A Probabilistic Interpretation of the Constant Gain Learning Algorithm

This paper proposes a novel interpretation of the constant gain learning algorithm through a probabilistic setting with Bayesian updating. The underlying process for the variable being estimated is not specified a priori through a parametric model, and only its probabilistic structure is defined. Such framework allows to understand the gain coefficient in the learning algorithm in terms of the probability of changes in the estimated variable. On the basis of this framework, I assess the range of values commonly used in the macroeconomic empirical literature in terms of the implied probabilities of changes in the estimated&#160;variables.

determines the weighting structure on past observations. In this regard, learning algorithms can be divided into two classes: decreasing gain (DG) and constant gain (CG) algorithms.
The most common instance of a DG algorithm is the recursive least squares, where the gain coefficient is set equal to 1∕ , being the time period of the estimate, and all observations are thus weighted equally: this is suitable for estimating quantities that are believed to be constant over time. With a CG algorithm, instead, more recent observations receive a higher weight, and the weights decrease geometrically with time: this is usually employed when the estimated parameters are believed to change over time, as it allows for better tracking.
While the value of the gain coefficient in the DG case is obvious and has a clear interpretation, much less straightforward is the interpretation of the gain in the CG case, and there seems to be no agreement in the literature on an appropriate value for it. The aim of this paper is to provide a novel interpretation of the CG coefficient and thus help researchers in need to select a value for such parameter.
A growing literature in applied macroeconomics has used CG learning to explain a range of features, from the rise and fall of U.S. inflation in the 70s and 80s (in particular, the seminal works of Sargent (1999) and Sargent, Williams, & Zha (2006)) to the causes of business cycles (e.g., Milani (2011) and Eusepi & Preston (2011)). Though there is no direct evidence of the appropriate value for the gain parameter, Berardi and Galimberti (2017) provide a thorough discussion of the role and estimate bands for the gain parameter in macroeconomic applications. In general, higher gains imply faster reaction to changes, but more volatile estimates.
The CG algorithm is a reduced form learning model, which could be derived as an optimal solution of inference in a number of underlying frameworks. For example, Muth (1960) has shown how adaptive expectations can be optimal under certain assumptions about the structure of the variable being forecasted. A CG algorithm for estimating the (mean) value of a variable, in fact, implements adaptive expectations, and as such it provides optimal forecasts under conditions specified in Muth (1960). Those conditions are quite restrictive on the underlying process for the variable being forecasted, which must be representable as an infinite sum of current and past exogenous disturbances, with appropriate weights related to the gain parameter.
A CG algorithm can also be obtained through a Kalman filter model, which implements Bayesian updating in a state-space framework, with appropriate initial conditions. It is well known that with a time-invariant state-space model, the Kalman gain converges to a constant: choosing such constant as initial value for the gain, the Kalman filter gives rise to a CG algorithm. The natural interpretation of such gain coefficient is usually in terms of the variances of disturbances in the measurement and transition equations. An analysis of the differences between CG least squares and the Kalman filter is provided in Sargent and Williams (2005), who highlight the effect of different priors on the convergence properties of Bayesian learning. Similarly, Evans, Honkapohja, and Williams (2010) propose a constant gain generalized stochastic gradient algorithm, which can be viewed as an approximate Bayesian learning scheme when agents allow for parameter drift in their beliefs.
I propose here instead an interpretation of the CG learning algorithm through a probabilistic setting where Bayesian learners estimate recursively the value of an unobservable variable through a signal. The underlying process for the variable being estimated is not specified a priori through a parametric model, and only its probabilistic structure is defined. This framework allows for a novel interpretation of the gain coefficient in terms of the probability of changes in the estimated variables. I then assess the values of various gain coefficients used in empirical studies against this background, deriving some implications on the implied frequency of changes in the estimated variables.
In general, Bayesian learning in a state-space model requires knowledge (or estimation) of the variance covariance structure of state and measurement noise. Adaptive learning relies instead usually on much less information, though in the case of a constant gain, it introduces a free parameter to be determined. Such determination can be arbitrary, and hence likely to be inefficient in terms of the implied prediction error, or based on some additional information on the dynamics of the system. My proposed interpretation of the gain parameter allows one to understand (or select) its value on the basis of the probability of changes in the estimated variables: such probability thus represents the additional information required to pin down an otherwise arbitrary parameter in an adaptive learning scheme.

CONSTANT GAIN ALGORITHM AND BAYESIAN LEARNING
Recursive learning algorithms can represent optimal learning behaviour under certain assumptions about the underlying quantities being learned and the observations available. The simplest example is the case where a constant needs to be estimated, and the series of observations include i.i.d. noise with constant variance. In such case, a decreasing gain algorithm with gain equal to 1∕ is optimal, as it provides the least squares estimate of the sample mean. If the underlying variable to be estimated is instead time-varying, the literature suggests that, in general, a constant gain algorithm should be used, as it puts more weight on more recent observations and thus allows for better tracking.
More sophisticated approaches to the determination of the gain coefficient have also been proposed in the literature, endogenizing the gain to the volatility of the system. For example, Marcet and Nicolini (2003) in a contest of hyperinflationary threats, propose a learning algorithm where the gain coefficient switches from decreasing to constant (and viceversa) depending on the level of instability detected in the data: as instability increases, agents switch from a decreasing to a constant gain learning mechanism, in order to enhance tracking relative to smoothing. In the same hyperinflationary framework, Kostyshyna (2012) proposes an adaptive step-size algorithm which allows the gain to evolve in response to changes in the environment and results in an increasing gain during hyperinflationary episodes and a decreasing gain after the hyperinflation ends. In Milani (2014) the gain coefficient is also endogenous and is adjusted according to past forecast errors: when the average of the past forecast errors is below a certain threshold, agents use a decreasing gain, while if average past errors is above the threshold, agents use instead a constant gain, for fear that the economy may be experiencing a structural break.

Constant gain algorithm
Suppose agents need to estimate the (time-varying) mean of a random variable over time. Denoting̃such estimate, the CG algorithm takes the form where I have used the assumptioñ1 = 1 , since no previous information is available to agents. The constant gain determines the weight put on past observations, as for = 2, … , , where denotes the weight put at time on time ≤ observation. Since < 1, it can be seen that weights on past observations decrease exponentially with time.
The same relationship between gains and weights holds also for a multivariate model where agents estimate a vector of time-varying coefficients through a linear regression model. See Berardi and Galimberti (2013).

A probabilistic Bayesian learning framework
Consider now a framework where agents are interested in estimating the value of an unobservable variable , ≥ 1. Nature draws the value at some time = 0 from an improper uniform distribution over ℝ. Consistently, agents have a flat (uninformative) prior on its value at time = 1. Nature can also re-draw, with some fixed and known probability 0 ≤ ≤ 1, a new value for the variable, again from an improper distribution over ℝ, at the beginning of each period > 1. At every period ≥ 1 agents receive a signal on the value of , in the form where is an i.i.d. random variable, normally distributed with zero mean and constant variance 2 .
I first definē= Electronic copy available at: https://ssrn.com/abstract=3721354 The coefficients capture the probability that each (truncated) series̄is the appropriate one for computing the conditional expected value of , that is, the probability that Nature re-drew at the beginning of time and never after. Clearly, (7)-(8) are the same as (3)-(4) for = .
Substituting (7)- (8) into (6) and collecting terms for each , it is then possible to rewrite the posterior meañas a weighted sum of current and past values of as where It can be shown that ∑ =1 ℎ = 1. Clearly if = 1 ( changes for sure every period), ℎ = 0 for < and ℎ = 1 for = : only the last observation matters. If instead = 0 ( constant) then all observations receive the same weight 1∕ . This last case gives rise to a decreasing gain algorithm, implementing recursive least squares (equivalent to stochastic gradient in this case) which is simply the sample mean.
To better understand the weighting structure defined by (10)-(11), I propose Figure (1). Observation 1 is relevant for inference about the current value of only if Nature never re-drew over the whole sample period from 1 to , which happened with probability (1 − ) −1 : in such case each observation in that sample should be weighted equally, with weight 1∕ . Observation 2 is relevant if Nature never re-drew (again, with weight 1∕ ), which happened with probability (1 − ) −1 , or BERARDI F I G U R E 1 Weighting structure of signals if it re-drew at the beginning of period 2 and never after (and in this case, with weight 1 −1 ), which happened with probability (1 − ) −2 . And so on.

Flat prior and independent draws
In the proposed framework, Nature can re-draw the variable of interest at every period from an improper distribution over ℝ, with some fixed probability 0 ≤ ≤ 1. This means that Nature can draw any value for , from an unbounded set. This assumption justifies the flat prior on the new draw, and allows for the weighting structure proposed above. Having a model-free evolution of the variable of interest is crucial for the proposed framework, and contrasts it with alternative settings where the estimated variable changes according to a known dynamic model. For example, if agents knew that was to follow an AR(1) process with known autoregressive coefficient (say, a random walk), with Nature drawing each period the i.i.d. innovation, then a Kalman filter would be the optimal algorithm to employ for tracking and estimating the value of . It is the model-free assumption that allows our new interpretation of the weights in terms only of (and time) and not of other elements of the assumed dynamic model for .
From an applied perspective, this modelling choice can be justified in particular when there is no a priori knowledge of what the changes in could look like, and when there is no reason to believe that the current value of the estimated variable should give any insight into the possible new value. These are of course extreme assumptions, but provide a sharp characterization of the weighting structure on past observations and allow for its interpretation purely in terms of probabilities.

A COMPARISON
In light of the proposed framework, it is instructive to analyse the relationship between the adaptive learning gain and the probability in the proposed Bayesian learning model. The gain parameter in an adaptive learning algorithm determines the weight put on past observations: with a decreasing gain 1∕ , all observations receive equal weight; with a constant gain , instead, the weight decays exponentially with past observations. A similar interpretation can be given to , which represents the probability of a change in the variable happening at each time : this determines the probability that each observation from time , 0 < ≤ is relevant for time inference, which, together with the number of observations, determines individual weights. The weighting structure represented by (10)-(11) cannot be generated exactly by a CG algorithm for finite . Nevertheless, even if an approximation only, it provides a means to interpret the weighting implied by such algorithm. Clearly, if one sets = then = : if the constant gain is to be interpreted as the probability , the weights put on individual observations through the CG algorithm are the weights put on past truncated series of observations in the probabilistic Bayesian setting. In such setting, weights on individual observations are instead given by (10)-(11), which, in a non-recursive way, can be rewritten as While the weighting structure on individual observations in the Bayesian framework is more convoluted than that in the CG algorithm, both and the leading term in ℎ (represented by ) decays exponentially, leading to similar weight profiles on older observations. In fact, for = , the leading term in ℎ is equal to − +1 . Figure 2 shows and ℎ , computed for = = 0.025 with = 100. Figure 3 then shows the same series, but for = 1, 000 (the second and third frames zoom in, respectively, on the first and last 100 points of the series). It can be seen that as increases, and ℎ get closer to each other for small values of , while for high values of (that is, for observations closer to the time of estimation) the difference between the two terms remains largely the same. Weights , ≥ 1, are independent of (they depend instead on − ; that is, = + + ). The same is not exactly true for ℎ , though quantitatively it is indeed the case that ℎ ≃ ℎ + + . This is due to the fact that the leading − 1 terms out of the total + terms in ℎ + + are the same as the leading − 1 terms out of the total terms in ℎ (the ℎ term differs by ), with the additional terms in ℎ + + negligible in size. Thus the final end of the and ℎ curves tend to remain at the same distance as increases. The two curves, instead, get closer and closer to each other on their initial part as increases, It can be seen that, despite being derived in different frameworks, the shape of the two weighting structures is remarkably similar, leading to similar weighting on past information in the two cases. Figure 4 shows the difference Δ = − ℎ (where and ℎ are the vectors { } =1 and {ℎ } =1 ) for = 1, 000 and = = 0.025.

CONSTANT GAINS IN THE EMPIRICAL LITERATURE
Using the framework developed above, one can interpret the constant gain coefficients that have been found to fit the data well in empirical macroeconomic studies in terms of the implied probability of changes in the estimated variables. Typical values proposed (estimated or calibrated) in the empirical literature for the constant gain range from close to zero to over 0.2, with most  (2017). A larger value of about 0.27 has been documented for example by Benhabib and Dave (2014) in the context of an asset pricing model, 1 while smaller values have been reported by Markiewicz and Pick (2014) using data from professional forecasters, with values as small as 0.001 documented, depending on the specific data and model specification used. In the context of age-dependent gains, Malmendier and Nagel (2016) find gains as high as 0.8 in the early stages of life, when little data is available, decreasing then exponentially to zero as one gets older: this age-dependent structure, though, matches into an average gain of around 0.018, in line with typical values in the macroeconomic literature. 2 Irrespective of the specific value selected for the gain, one can use the framework proposed here in order to compute the implied probability of changes in the estimated variables that corresponds to such coefficient by finding the that, for a given , minimizes the sum of (squared) deviations between the weighting structure implied by the gain and the weighting structure of the Bayesian framework. That is, one can computê ( ) = arg min Δ ( ) ′ Δ ( ) where the notation for̂( ) and Δ ( ) makes explicit the dependence on both and . Fixing , one can find the implied probability for a certain gain coefficient as a function of the number of observations. Figure 5 shows such measure for = 0.025. It can be seen that for large enough values of ,̂( = 0.025) stabilizes and becomes constant. One can thus compute the value of ( ) for large , 3 obtaining a function that gives the implied (asymptotic) probabilitŷfor any value of . In particular, I restrict the range of between 0.01 and 0.1, which contains most values used in the empirical literature: Figure 6 shows the results. It can be seen that, for gains between 0.01 and 0.1, the implied probability of changes in the estimated variable(s) each period ranges from 0.31 per cent ( = 0.01) to 3.59 per cent ( = 0.1).

CONCLUSIONS
This paper has proposed a probabilistic Bayesian framework that allows for a novel interpretation of the weighting structure on past observations implied by a CG learning algorithm. By assuming a model-free representation of the evolution of the variable being estimated, one does not need to specify the variance-covariance structure of state and measurement noise in order to derive the optimal Bayesian weight on past information and the gain coefficient can, under this interpretation, be linked only to the probability of changes in the estimated variables. Using this framework, it is possible to map the gain coefficient used in empirical studies into implied probabilities. In particular, most works in the macroeconomic empirical literature use a gain coefficient between 0.01 and 0.1, which map into per-period probabilities of changes in the estimated variable between 0.31 per cent and 3.59 per cent.
Such new understanding of the role played by the gain parameter in a learning algorithm could be useful for researchers in choosing an appropriate value in empirical studies.