How to select an objective function using information theory

In machine learning or scientific computing, model performance is measured with an objective function. But why choose one objective over another? Information theory gives one answer: To maximize the information in the model, select the most likely objective function or whichever represents the error in the fewest bits. To evaluate different objectives, transform them into likelihood functions. As likelihoods, their relative magnitudes represent how much we should prefer one objective versus another, and the log of their magnitude represents the expected uncertainty of the model.


Introduction
Science tests competing theories or models by evaluating the similarity of their predictions against observational experience, favoring those that best fit the evidence.Thus, how scientists measure similarity fundamentally determines what they learn.In machine learning and scientific computing, an analogous process occurs when calibrating or evaluating a model.In that case, the model is "tuned" according to a similarity metric, known as an objective or loss function.A classic example is the mean squared error, which is the optimal measure of similarity when errors are normally distributed and independent and identically distributed (iid ).But for many problems, the true error distribution is complex or unknown.Rather than simply assuming some de facto objective function, information theory should guide that choice.
The debate series led by Kumar and Gupta (2020) posits that information theory provides a new paradigm for Earth science.Here we give a basic, yet motivating, demonstration by using information theory to help solve a fundamental scientific question: How should we quantify similarity, or put another way, how should we quantify uncertainty?
This paper draws extensively from textbooks by Cover and Thomas (2006) and Burnham and Anderson (2002).To learn more about the deeper connection between information theory and probability theory, and maximum likelihood estimation in particular, refer to these or other textbooks.Within the Earth science literature, Weijs and Ruddell (2020) and Nearing et al. (2020) provide good background for this paper.
Two definitions are also helpful.The "model" is the knowledge or theory that explains or captures information shared among some variables.All models are approximations: they explain some things but not everything.The "uncertainty" represents the information that cannot be explained by the model.The uncertainty is also represented with a model-everything we know is-but this distinction is extremely useful.When calibrating a model, we optimize an objective function in order to minimize the model's uncertainty.This idea-that the objective function is a model for uncertainty-is fundamental.Like any model, we can use statistics to evaluate different objective functions.
Here, we demonstrate one approach using maximum likelihood estimation, which is relatively simple and well suited to large physical simulations and machine learning models that are used extensively within Earth science.

The Experiment
In the classic modeling experiment, a model is varied (or "tuned") while the test data and objective are held fixed.To select the "best" model, choose whichever model optimizes the objective function computed on the test data.If mean squared error (MSE) is the objective, compute the MSE between the test data and the model predictions, then select the model with the lowest MSE.To select the "best" objective, flip the classic experiment by varying the objective while the model and data are held fixed.Now, select the objective indicating the greatest similarity between the data and the model.Different objective functions have different scales, so they are normalized such that each in-tegrates to one, thereby representing them as probability distributions.The normalized form of MSE is the normal distribution, for example, (Hodson, 2022).When used to evaluate model fit, that distribution function is called a likelihood function and its output the likelihood.To select among objectives, compare their likelihoods, and favor the most likely.Taking the natural logarithm of the likelihood, denoted as ℓ, does not change the model ranks but simplifies the math by converting products to sums: likelihoods multiply, so log-likelihoods add.But besides being easier to compute, ℓ also represents the expected uncertainty.
Thus far, the problem is framed in terms of probability theory, but information theory gives an equally valid interpretation.The goal of the former is to find the most likely model, whereas, the goal of the latter is to find that giving the best compression (Cover & Thomas, 2006).The deeper connection between concepts like information, uncertainty, probability, similarity, objective functions, likelihoods, and data compression is briefly reviewed in the next section.

Uncertainty to Information
To maximize the information in the model, select the most likely objective function or whichever represents the error in the fewest bits.The explanation follows and summarizes Cover and Thomas (2006) and Burnham and Anderson (2002) Three fundamental concepts in information theory are (1) the entropy H(D), which is the expected information in each new observation of the data D; (2) the conditional entropy H(D|M ), which is the additional information needed to represent D after encoding it with some model M (think of it as the information in the model error or the uncertainty); and their difference (3), known as mutual information, which measures how much information M encodes about D.
When comparing models against the same data, H(D) is constant C so only the conditional entropy H(D|M ) is informative.Now, the connection to probability theory: The entropy of a random variable X with probability density function f (x) is defined as Substituting the likelihood for f (x) and taking the limit as the number of observations n goes to infinity, the log-likelihood ℓ equals the negative conditional entropy and also the mutual information up to a constant, where the natural logarithm gives units of nats.Dividing by ln(2) converts the result to bits.For finite n, the average ℓ gives an unbiased estimate of I C unless the data used to estimate ℓ were also used to calibrate the model, which causes "overfitting" (Akaike, 1974).
Two points are evident from Equation 3. (1) Minimizing the entity in the second term maximizes that in the first and third (look at the sign of each).(2) The concepts of conditional entropy, uncertainty, and (dis)similarity are interchangeable.Therefore, to maximize the information in the model, select the most likely objective function or whichever represents the error in the fewest bits.
There are several ways to explain this basic idea.Here is another: Recall that conditional entropy H(D|M ) encodes the uncertainty.Implicit is that H itself is a model with certain assumptions A, denoted as H(D|M, A).A is commonly omitted to simplify the equations but is pertinent here because it represents our assumptions about the distribution of the error.Like any model, we can apply statistics to evaluate those assumptions, as is done in a variety of well-known methods, like generalized linear models, Bayesian model selection, and empirical likelihood.

Objectives to Log Likelihoods
Given a large dataset, we can estimate the conditional entropy H of different objectives from the maximum likelihood estimate of their log-likelihoods (Equation 3); as likelihoods, objectives are transformed to a common scale.The first, and arguably de facto, objective is MSE, which corresponds to the log-likelihood of the normal distribution (Figure 1), where y i are the observations, ŷi are the model predictions, and σ is standard deviation of the error.The final term of the log-likelihood represents the squared error in the objective and the remaining terms normalize the result (ℓ 2 is also known as the L2 norm or the Euclidean norm).Another common objective is the mean absolute error (MAE), which corresponds to the log-likelihood of the Laplace distribution (Figure 1), where b is the mean absolute error (also known as the L1 norm).
Likelihoods for a variety of other objective functions are obtained by changing variables.For example, the mean squared log error (MSLE), which corresponds to the lognormal log-likelihood ℓ 3 , is obtained from ℓ 2 by changing variables where v, the natural log in this case, can be substituted with other functions to obtain log-likelihoods for normalized squared error (NSE; Nash & Sutcliffe, 1970), mean squared percent error (MPSE), as well as their Laplace equivalents (derivations and additional explanation given in Appendix B).
Likelihoods can also be combined as mixtures to represent different states.Our demonstration tests streamflow data that can have zero or negative values for which ℓ 3 is undefined.The "zero flow" state is handled by mixing a binomial for the zeros, with a lognormal for the non-zeros (Smith et al., 2010).Taking n 1 as the number of correctly predicted zero flows, and n 2 as the number of incorrect zero flows, the probability of correctly predicting the flow state among n 1 and n 2 is estimated as ρ = n 1 /(n 1 +n 2 ).The corresponding binomial log-likelihood is and the mixture is where ℓ 0 is evaluated over the zeros (n 1 + n 2 ), and ℓ 3 is evaluated over the remaining observations (n 3 ).This type of binomial mixture is sometimes called a "zero-inflated" distribution for the way it inflates the probability of zeros.

Overfitting Bias
The convergence between conditional entropy and log-likelihood (Equation 3) only holds when evaluated "out of sample," meaning the data used to estimate the log-likelihood were not also used to calibrate the model, as opposed to "in sample".Otherwise, the loglikelihood is biased, known as overfitting (Burnham & Anderson, 2002).Various approaches to correcting that bias give rise to well-known information criteria like AIC, BIC, and others (Akaike, 1974;Schwarz, 1978).
Information criteria are difficult or impossible to compute for certain classes of models, notably deep neural networks (Watanabe, 2010), so in practice, cross-validation is widely used to estimate the unbiased log-likelihood.But cross-validation is also costly: Data are either omitted or else multiple calibrations are performed on different subsets.When evaluating objectives, the risk of overfitting tends to be less severe, because the overfitting bias increases with the number of parameters and diminishes with the number of observations (refer to Appendix C).In our experiment, the model structure is fixed -the model always incurs the same overfitting penalty-so only the bias from the objective parameters affects their relative likelihoods.Objective functions typically have very few parameters-ℓ 2 has only σ-so they have less potential to overfit the data.Given a large dataset, the "in sample" log-likelihood may suffice for evaluating objectives; Appendix C provides some additional intuition about gauging the potential for overfitting.However, if the model structure is also being optimized during the experiment, then cross-validation may be necessary.

Weights
Given the conditional entropies H (out-of-sample log-likelihoods) for a set of m models, the "weight" of evidence for each model w i is where the base x is 2 for bits or e for nats , and the set m may include different models or objectives or combinations thereof.Variants of these weights occur in several texts; here substituting −H for AIC (Burnham & Anderson, 2002;Hastie et al., 2009).The weighting renormalizes the entropies to sum to one, such that they represent the probabilities of each model being true, assuming (1) the likelihood (observational evidence) overwhelms any prior information, and (2) the true model is among or at least well-approximated by the candidate models.

Criticism
The previous assumptions are inappropriate for some problems, for which Bayesian methods, which consider full probability distributions rather than just their maximums, are better.But Bayesian methods are also more complex computationally and less beneficial for large datasets (because of the convergence in Equation 3).
As objective functions typically have relatively few parameters, maximum likelihood estimation is a pragmatic means to evaluate different objectives.As shown in the demonstration next, a poorly chosen objective introduces noise, which causes information loss.For context, a typical cross-validation scheme omits 20 percent of the data during calibration, so you can expect to lose 20 percent of the available information.A poor objective can easily surpass that.

Benchmark Demonstration
To demonstrate, we compute the conditional entropies given by several objective functions with a simple streamflow model.The test data are daily streamflow observations from 1,385 streamgages in the conterminous United States (Russell et al., 2020); roughly 14 million observations in total.As streamflow can be zero or negative, which is undefined for some objective functions, flows below 0.0028 m 3 s −1 (0.01 ft 3 s −1 ) were thresholded and treated as the "zero-flow" state in the comparison.Different thresholds may yield slightly different results, particularly among logged objectives because of their sensitivity near zero.
We could use any model for this demonstration, so we chose a simple one: predict streamflow at a location by scaling the nearest concurrent observation by the ratio of the two drainage areas.So when predicting flow in a large river using observations from a smaller one, scale up the observations accordingly.By nature, the predictions are out of sample, so neither cross-validation nor bias adjustment is necessary.Besides being simple, it also represents the case in which the model is physically realistic, but its boundary conditions are uncertain (a common problem in Earth science).Another useful experiment might calibrate one-or-more models with several objectives, then evaluate each combination.Our experiment simulates uncertainty in the boundary conditions and measures how well different objectives represent that uncertainty, whereas the other tests different combinations of model and objective (uncertainty) for a particular problem.
Table 1 gives the conditional entropies ( Ĥ) from each objective, measured in average bits per datum (bit rate).The best objective represents the error in the fewest bits.For this particular benchmark, zero-mean absolute log error (ZMALE) performed best, with an entropy rate of 6.95 bits.For comparison, MSE was 11.62 bits, and NSE, the de facto in hydrologic modeling, was 11.20 bits.The magnitude of the entropies is unimportant here, only their differences.The lower conditional entropy and, therefore, better performance of objectives that log transform the data makes logical sense because hydrologic models typically make greater errors at greater streamflows; logging reframes the error into proportional terms rather than absolute ones.For that reason, some have argued for log-transformed objectives in hydrology, but the practice is still somewhat unorthodox (Clark et al., 2021).Rather than arguing, such claims can be evaluated by comparing the conditional entropies (log-likelihoods) of the objectives.
In the experiment, the data and model were fixed, so too were the errors.All that changed was how we measure the information in the error.Relative to ZMALE, the excess bits in the other objective functions are, therefore, noise.So, MSE measures at least 40 percent noise, and NSE at least 38 percent.In general, noisier objectives convey less information and so require more iterations during calibration, more data to reach the solution, and produce models that require more storage space (better model, better data compression).A well-known example of that point is stochastic gradient descent, where noise in the objective causes slower convergence (Bottou & Bousquet, 2007).In that case, each iteration completes faster, so the solution may be reached quicker overall.A poorly chosen objective incurs a similar penalty but potentially without benefit.

Conclusions
With the exception of Bayesians, objective functions are rarely benchmarked despite their ubiquitous use as a basis for learning.We do not advocate for one over another- the choice varies from problem to problem-only that benchmarking objectives is a good practice that can yield better models, as well as uncertainties.Not every study needs to formally benchmark its objective.Simple rational arguments are also quite effective: log transformations tend to work well with lognormal data, but when in doubt, objectives are easily evaluated as likelihoods.The objective function measures a model's performance, as well as its uncertainty and compressibility, so it should be chosen with care.
Ultimately, how well machines-and scientists-learn and think depends on how well they measure uncertainty.
Using that notation, the log-likelihoods for the other objectives are (1) mean squared log error (MSLE), In each, the second term is the derivative of the transformation used in the change of variables; so in ℓ 3 , 1/y is the derivative of ln(y).The derivative term could be simplified to − ln(y), but we left it to make the derivation clearer.Log likelihoods for MALE, and ZMALE are represented by substituting ℓ 1 for ℓ 2 .Finally, the log-likelihood of the uniform error (U) is ℓ 8 = −n ln(max(y − ŷ)), (B6) which is an objective that minimizes the maximum error.
Note the distinction between NSE and MSE : In NSE, the errors are normalized by dividing them by the variance of the observed flow at that location; whereas in MSE, the errors are left in their original units.By that definition, their log-likelihoods will be equivalent when measured at a single location but will differ when measured over multiple locations with different variances.

Table 1 .
Estimated entropy Ĥ and weights of ten objective functions evaluated against the test data and model.
*Nash-Sutcliffe efficiency.**Undefined for zero flow but included for context.