An Introduction to Predictive Processing Models of Perception and Decision-Making

The predictive processing framework includes a broad set of ideas, which might be articulated and developed in a variety of ways, concerning how the brain may leverage predictive models when implementing perception, cognition, decision-making, and motor control. This article provides an up-to-date introduction to the two most inﬂuential theories within this framework: predictive coding and active inference. The ﬁrst half of the paper (Sections 2–5) reviews the evolution of predictive coding, from early ideas about efﬁcient coding in the visual system to a more general model encompassing perception, cognition, and motor control. The theory is characterized in terms of the claims it makes at Marr’s computational, algorithmic, and implementation levels of description, and the conceptual and mathematical connections between predictive coding, Bayesian inference, and variational free energy (a quantity jointly evaluating model accuracy and complexity) are explored. The second half of the paper (Sections 6–8) turns to recent theories of active inference. Like predictive coding, active inference models assume that perceptual and learning processes minimize variational free energy as a means of approximating Bayesian inference in a biologically plausible manner. However, these models focus primarily on planning and decision-making processes that predictive coding models were not developed to address. Under active inference, an agent evaluates potential plans (action sequences) based on their expected free energy (a quantity that combines anticipated reward and information gain). The agent is assumed to represent the world


Introduction
This paper traces predictive processing (PP) from its early, general formulations to its more recent application as a theory of decision-making in the active inference framework.The first half of the paper (Sections 2-5) focuses on early formulations of PP and explains its appeal as a general theory of cognition and behavior; predictive coding is introduced as a prominent PP model of perception.The second half of the paper (Sections 6-8) focuses on more recent developments, outlining how active inference provides a powerful model of decision-making.
As should already be evident from this brief outline, what "predictive processing" means has varied over time.Even now, the framework is hard to pin down or summarize in a comprehensive and fully precise way.Different aspects of the framework tend to be articulated in different ways by different authors, and different features have been accorded greater or lesser significance in different contexts.Key features of the framework are also frequently left open, with the assumption that the relevant content will be filled in by later work.Indeed, critics have accused PP of being underspecified as a model of cognition (Cao, 2020;Litwin & Miłkowski, 2020).In our view, PP should be understood, not as a fully articulated model of cognition, but as a broad research program in computational neuroscience with a set of evolving commitments that are only gradually being uncovered and agreed upon.It would be a mistake to try to state now what PP, as an umbrella framework under which many distinct models could fall, is and is not committed to.With this qualification in mind, the present paper will review what a plausible variant of this framework is likely to say, with the caveat that this may be changed, further nuanced, or recast in the future.To be considered plausible, however, we assume any account will remain at least largely consistent with the most general proposed definitions of PP, such as those put forward by Clark (Clark, 2013b;Clark, 2016), which view the brain as a multilevel prediction engine that makes use of some form of hierarchy.
Early versions of PP aimed to provide a computational framework that was applicable to many-perhaps to every-domain of cognition and behavior.They also aimed to span multiple levels of inquiry in the cognitive sciences.This can be captured in terms of Marr's computational, algorithmic, and implementation levels of description (Marr, 1982).At the computational level, PP suggests that the problem the brain faces in cognition is to minimize sensory prediction error (although the nature of the predictions in question can differ by application).At the algorithmic level, it suggests that the method by which the brain attempts to solve this problem consists in the iterative dynamics of a hierarchical network of simple prediction and error units.At the implementation level, PP suggests that this algorithm is primarily implemented in the neocortex, with distinct brain regions, and distinct cell populations inside each brain region, coding these predictions and prediction errors for various hierarchically linked content domains.That said, various hypotheses about the role of subcortical regions have also been proposed, especially with respect to how confidence in predictions is encoded.These claims, pertaining to the computational, algorithmic, and implementation levels, are described in Sections 2, 3, 4, and 5, respectively.Here, it is worth noting that our description more or less corresponds to a variant of the more specific algorithm of predictive coding, which has been especially prominent in the PP literature and has had previous applications in machine learning.Thus, other algorithmic-and implementation-level hypotheses and extensions are possible (e.g., see Bastos et al., 2012;Shipp, 2016).
In more recent years, ideas from PP have been developed into a precise model of decisionmaking.This is known as the active inference framework.Somewhat confusingly, the term "active inference" was used in earlier articulations of PP to refer to a model of motor control, based around the minimization of proprioceptive prediction errors.However, in current parlance, "active inference" is primarily used to refer to a model of planning and decision-making under uncertainty; namely, a model of how a cognitive agent with imperfect sensors should combine belief-like states and goal-like states to decide upon sequences of action over time (Smith et al., 2022b).This is most akin to modern reinforcement learning models or hierarchical Gaussian filter models of decision-making in computational neuroscience (Mathys et al., 2014).However, in addition to goal/reward seeking, active inference also places a strong emphasis on Bayesian inference and decisions that seek out informative observations to reduce uncertainty about how to act.
The active inference framework has points of commonality with earlier versions of PP: it shares the algorithmic goal of minimizing a quantity similar to prediction error (variational free energy; discussed further below), it shares a commitment to probabilistic representation and approximate Bayesian inference over a hierarchical probabilistic generative model, and it extends prediction error minimization to anticipated future observations (i.e., minimizing expected deviations between preferred and predicted observations, and trying to identify beliefs that best predict future observations).However, it departs from earlier formulations by modeling inference within partially observable Markov decision processes (POMDPs) with discrete time and discrete states.Section 5 describes the connection between active inference and earlier versions of PP.Section 6 defines the POMDP model.Section 7 describes some applications of the model, including constructing simulations and performing inference on empirical data.

The computational level
PP proposes that the computational problem faced by the brain across many, and possibly all, aspects of cognition is to minimize sensory prediction error.As we will see, this task has important connections to two further problems: minimizing variational free energy and performing Bayesian inference.
Dating back at least to the work of Attneave and Barlow, it has been proposed that the sensory coding scheme used by the brain is "efficient" (Attneave, 1954;Barlow, 1961).The relevant form of efficiency in this context refers to the physical resources used by the brain, such as the metabolic cost of spike generation, wiring costs, and so on.A key insight, however, was that efficiency can also be quantified formally in information-theoretic terms; specifically, the number of bits the brain would need to encode sensory information.The efficient coding hypothesis suggests that the brain aims to minimize this number of necessary bits (Simoncelli & Olshausen, 2001;Srinivasan, Laughlin, & Dubs, 1982;Sterling & Laughlin, 2015).
Rao and Ballard proposed that this kind of efficient coding characterizes not only the first stages of visual processing (e.g., activity in the retina, lateral geniculate nucleus, and V1), but also later stages deeper in the cortex (e.g., activity in V2, V4, MT, and MST; Rao & Ballard, 1999).They claimed that the functional anatomy of the visual cortex could be interpreted as a hierarchical predictive model whose task was to minimize prediction errors concerning sensory signals.The more accurate this hierarchical model's predictions, the fewer bits would be needed to encode signals from the sensory boundary.In other words, all that would need to be sent inwards would be an error signal with respect to predictions.On this kind of "predictive coding" model of perception, the computational problem the brain faces during vision could be conceived, not as constructing a three-dimensional representation of the world in a bottom-up fashion from incoming sensory data, but as using an internal, hierarchically structured model to predict incoming sensory data-only revising its representation of the world when necessary to reduce prediction errors (Huang & Rao, 2011;Mumford, 1992;Spratling, 2017).Building on this initial work, Friston proposed a more general theory of cortical function in terms of predictive coding that spanned all sensory modalities and incorporated higher-level integration processes (Friston, 2005).He also highlighted a way in which predictive coding could be understood to minimize a broader information-theoretic quantity called "variational free energy" (discussed in the next section), which corresponds to a combination of the surprisingness of a set of observations under a model and how much beliefs are adjusted to reduce that surprise.This quantity had previously been employed by others in the context of machine learning and statistical model optimization (Bishop, 2006;Hinton & Zemel, 1994;MacKay, 2003;Winn & Bishop, 2005).Building on this previous work, Friston suggested that the brain may also optimize its own generative model of the world in a similar manner-striving to maximize the accuracy of its predictions while also changing beliefs as little as possible (i.e., arriving at the most parsimonious interpretation of new, unexpected data).
Utilizing prior work on optimal control theory, Friston and colleagues observed that motor control could also be described in terms of prediction error (and variational free energy) minimization (Adams, Shipp, & Friston, 2013;Friston, 2010;Friston, Daunizeau, Kilner, & Kiebel, 2010;Shipp, Adams, & Friston 2013;Todorov, 2009;Todorov & Jordan, 2002).Rather than the brain's internal model of body position (proprioception) always being revised to better predict sensory input over time, motor control could be implemented by temporarily holding predictions constant with respect to body limb location (corresponding to target body states), and minimizing error by adjusting body position to match predictions.Perception and motor control could thus both be viewed as instances of the single task of sensory prediction error minimization.A weighting mechanism would simply be required to dynamically adjust when prior beliefs about body position either were or were not held fixed.Over a series of papers, Friston developed this idea further, arguing that broad swathes of action, perception, value-based decision-making, and learning could be characterized in similar terms (Friston, 2005;Friston, 2010;Friston, 2003;Friston, 2009).
However, in order to make this claim plausible, "minimizing prediction error" has to be understood in a broader and more flexible manner than was commonly portrayed in early work on efficient coding in low-level sensory systems.In this context, two features in particular are worth highlighting.
First, the problem facing the brain should be understood, not as minimization of current prediction error or environmentally typical prediction error, but as minimization of a longterm measure of prediction error specific to the individual agent in question.This quantity is typically formalized as a real number that reflects some aggregate of expected prediction errors for different sensory channels of that agent over time, and it is assumed to vary in a continuous manner.How to operationalize this abstract quantity and apply it in a real-world setting-where one might be dealing with diverse types of sensory channel, categorical judgments, and discrete events-is not obvious.The relevant time period is also unclear; that is, it is not obvious whether the objective should be to minimize prediction error across seconds, minutes, years, the remaining life of the organism, or evolutionary time.The interpretation of the probabilities that appear in the task description is also left somewhat open: Is the problem to minimize the brain's objective chance of making a future sensory prediction error or its own subjective estimate of doing so?Both are closely associated with PP, but they are different problems (Colombo & Wright, 2021;Sprevak, 2020).
Second, not every prediction error is given equal weight.The brain's goal is to minimize precision-weighted sensory prediction error."Precision weighting" refers to a further set of parameters that determine the system's responsiveness to different prediction errors within the optimization task.Formally, it corresponds to the (expected) inverse variance of prediction error, but can also be understood more loosely in terms of the expected reliability or usefulness of a given signal in a particular context.Precision weighting is claimed to be connected to psychological features, such as attention, salience, value, and uncertainty (Brown, Adams, Parees, Edwards, & Friston, 2013;Clark, 2013a;Feldman & Friston, 2010;Friston, 2003;Friston, 2009;Friston, Mattout, & Kilner, 2011).Within formal models, precision-weightings can be learned based on observed levels of variance in a signal.However, what this means in a real-world setting is not always clear.In particular, it is often unclear which physical, measurable factors constrain assignment of a particular matrix of precision weightings to an agent's model.This has led to the accusation that precision weighting can function as a "magic modulator" for PP, allowing almost any behavior to be modeled as prediction error minimization, provided one assigns the agent a suitable set of precision weights (Miller & Clark, 2018).

The free-energy formulation and Bayesian inference
Minimizing sensory prediction error has connections to both minimizing variational free energy and approximate Bayesian inference.
As mentioned above, variational 1 free energy is an information-theoretic quantity (i.e., not the thermodynamic free energy in physics) that applies to an agent's probabilistic predictive model.A predictive model encodes an agent's beliefs about unobservable states of the world (s).Variational free energy is formally defined in terms of the expected loss of information about s when using the agent's predictive model, q, in place of that of a perfect Bayesian reasoner, p .The amount of information lost by the agent using q instead of p is quantified by the Kullback-Leibler (KL) divergence: KL q p = ∫ s q (s) log q (s) − log p (s) ds A KL divergence measures the dissimilarity (degree of nonoverlap) between two probability distributions, in this case between q and p . 2 An agent that aims to approximate a Bayesian reasoner will try to choose a probabilistic model, q, that minimizes this KL divergence.If one plugs Bayes' rule into the equation above-that is, one assumes that the true posterior, p , equals the prior probability distribution, p, conditioned on an observation, o-then the KL divergence can be rewritten as: The first term on the right-hand side is the "variational free energy."The second term is the "surprisal" of the new data.Note that the surprisal term, −log p(o), does not depend on the choice of q.So, an agent that aims to approximate Bayesian inference-that is, an agent that aims to choose a q that minimizes the KL divergence with the true Bayesian posterior, p -can always be described as minimizing variational free energy.
This tight connection between variational free energy and Bayesian inference is exploited by variational methods used for optimization in machine learning.These algorithms search explicitly for a q from some family of distributions Q that minimizes variational free energy (or, equivalently, that maximizes the evidence lower bound, ELBO, which is defined as negative variational free energy; Bishop, 2006;Blei, Kucukelbir, & McAuliffe, 2017;MacKay, 2003).Sampling-based methods for approximate Bayesian inference-such as Gibbs sampling, Markov chain Monte Carlo simulation, particle filtering-rarely mention free energy (Bishop, 2006;MacKay, 2003).However, the same considerations apply.That is, to the extent that a sampling method is successful at approximating Bayesian inference, the empirical posterior distributions it arrives at will also minimize variational free energy (Aitchison & Lengyel, 2017;Gershman, 2019).
Unlike the relationship between variational free energy and Bayesian inference, there is no simple logical equivalence between variational free energy and prediction error.The envisioned relationship tends to have a more qualified form, such as: if an agent were to adopt a certain kind of probabilistic model and if it were to attempt to minimize free energy in certain ways, then it would also minimize prediction error (and vice versa).Identifying which assumptions most plausibly connect free energy minimization and prediction error minimization within the brain remains an open issue.One popular set of assumptions has been defended by Friston and colleagues (Daunizeau, 2018;Friston, 2005;Friston, 2003;Gershman, 2019;Friston, 2008;Friston & Kiebel, 2009;Friston, Mattout, Trujillo-Barreto, Ashburner, & Penny, 2006).These include a mean-field approximation-the claim that the agent's probabilistic model, q, can be factorized into a product of independent probability distributions, q (s) = i q i (s i ); the assumption that the q i form a hierarchically structured probabilistic model; the assumption that each individual q i is Gaussian; the Laplace approximation-that variational free energy can be approximated using a second-order Taylor expansion around the mode of the posterior; and the assumption that the brain attempts to minimize variational free energy using gradient descent.
It is important to stress, however, that such a derivation provides no reason to think that free-energy minimization somehow equates to prediction error minimization in some general or unqualified sense.Rather, it is that, in light of these assumptions, minimizing prediction error can be thought of as a good substitute or proxy for the problem of minimizing variational free energy.A brain wired to minimize prediction error would, given these assumptions, tend to minimize variational free energy and hence implement approximate Bayesian inference.One more general way to see this is through a commonly presented decomposition of variational free energy into two terms: (1) the KL divergence between prior and posterior beliefs over states; and (2) the probability of observations under a model (subtracted from the first term).Maximizing the latter term relates to prediction error minimization.However, variational free energy can also be reduced by minimizing the first term-which constrains beliefs from changing more than necessary.This plays the role of a complexity cost enforcing parsimonious beliefs.Thus, while minimizing prediction error will tend to reduce variational free energy (i.e., all else being equal; by increasing predictive accuracy), it can also be reduced in other ways (i.e., by reducing the complexity of one's model).

The algorithmic level
PP has (at least until recently) been associated with an algorithm roughly patterned after the predictive coding model proposed by Rao and Ballard (1999) and its extensions by Friston (2005).The basic idea in these models is that the brain minimizes prediction error using a hierarchically structured arrangement of simple prediction and prediction error units.Each layer of the hierarchy, considered in isolation, aims to predict the activity of the layer below.
Errors are passed up the hierarchy to the layer above, and predictions are passed down to the layer below.The error units at the base of the hierarchy receive external input corresponding to sensory stimulation.Prediction units send inhibitory signals to minimize the excitatory responses of their corresponding error units.The dynamics of the network are such that, when the bottom-most layer of error units is driven by a stream of external sensory stimulation, internal predictions will converge onto values that minimize the internal error-related activity generated by that stimulation.From one perspective, these "prediction" and "error" signals can be described solely in neural terms (i.e., directionally specific inhibitory and excitatory signals, respectively).At the same time, when grounded by the external features/regularities to which specific error signals are in fact sensitive, the exact connectivity pattern involved (i.e., which specific prediction units are set up to inhibit which error units, etc.) can also allow distinct downward (inhibitory) signals to be interpreted as specific content-laden predictions about the world (see Kiefer & Hohwy, 2017).
The units that make up a predictive coding network share several features with familiar connectionist networks: communication through weighted connections, changes in connection weights over time through associative learning rules, and the use of an activation function to control continuous response levels in each unit.A simple predictive coding algorithm might look like that shown in Fig. 1 (Harpur, 1997;Rao & Ballard, 1999;Spratling, 2017;).
The y i units are the prediction units, the e i units are the error units, and the x i are the external sensory inputs.The dynamics of this network are such that the activity of error units will be e = x − Wy, intuitively capturing a measure of sensory prediction error-that is, the difference between the sensory input, x, and an internally generated prediction, Wy.The weights of the connections between prediction units and error units (W) can be viewed as the network's "recipe" for combining and weighting individual prediction values to generate a prediction.In other words, they allow the network to act as a generative model of the incoming sensory signal.Precision weighting of error signals can be added to the model by introducing further lateral and self-inhibitory connections between error units (Friston, 2005;Feldman & Friston, 2010).
It is worth noting that, even for this simple network, excitatory error signals tend to flow upwards (from error units to prediction units), and inhibitory prediction signals tend to flow downwards (from prediction units to error units).As we will see in the next section, this arrangement mimics the basic structure of the neocortex.
The same computational motif can be extended to create a larger network, as shown in Fig. 2. Here, prediction units, y, connect to a second layer of error units, e .Those prediction units would function, for those error units, in a similar way as the external sensory input, x, functions for the lower units, e.A second layer of prediction units, y , aim to minimize activity in the e error units via their own set of weighted downwards connections.The y units would function to predict-namely, inhibit the results of-the latent activity of the prediction units (y) that drive those e error units.The same arrangement can also be repeated again, with the prediction units in higher layers aiming to predict the activity of units of the layer below, and so forth.The weighted connections between prediction and error units across the whole network can be viewed as a single generative model, reflecting beliefs about how hidden states of the world generate sensory signals (with activity in higher-level prediction Fig. 1.A basic predictive coding network.Weighted connections, w i, j , between prediction and error units are assumed to be reciprocal and of equal and opposite weights.Precision weighting and connections to/from other layers are not shown.units tending to track regularities across multiple sensory channels and/or longer temporal scales).
With respect to the Bayesian and free-energy formulations, a hierarchical network like this can be used to perform probabilistic inference (Bogacz, 2017).The sensory input signal, x, can be treated as an observation, o, of an agent.The prediction value, y, can be interpreted as the agent's current estimate of the value of some latent state variable in the world, s, that generated this sensory input.Let us assume that any such agent will have a prior belief about the value of s that is normally distributed, with mean μ and variance s .In terms of the network model, this prior, p(s), would be encoded by values in the layer above y: the mean, μ, would be encoded by the prediction generated from above (W y ), and the variance, s , by the lateral connections between the associated error units above ( ).The weighted connections to the lower level (W) would encode a likelihood function, p(o | s), which would yield, for each assumed value of s, a probability distribution over possible observations.Assume, for the sake of argument, that any such probability distribution over o would also be normally distributed, with a mean set by a deterministic function that captures how the agent believes Fig. 2. A hierarchical predictive coding network.The shaded area is equivalent to the network shown in Fig. 1 (with the addition of precision weighting, , and ascending/descending connections to the next layer).Individual units are shown aggregated together and represented by single nodes (e.g., all e i units in Fig. 1 are represented by e).Lateral and self-inhibitory connections on error units have been added (e.g., , , • • • ), which implement precision weighting of prediction errors at each level.This weighting may be dynamic and learned over time based on the variance of the incoming signal.Unlabeled connections (e.g., connections between y and e ) do not have additional modulatory weighting.The associated pattern of weights (W) entails the nature of the predictions made at each level given current beliefs y (and these weightings could also be learned, in principle).
that o depends on s, v (s) = Wy.The variance of a distribution over o ( o ) would be encoded by the lateral connections between the error units inside the layer below ( ).The associated network dynamics are then such that each change in sensory input (via e) will result in an updated posterior estimate of the value of s, p(s | o), encoded in the activation value of y.
With this setup in mind, imagine that the agent wishes to estimate the value of a hidden state in the world, s (e.g., the direction of motion of some object), given some observation, o (e.g., changing patterns of retinal stimulation).For the sake of simplicity, assume that the agent only wants the maximum a posteriori point estimate of s (i.e., not full Bayesian inference yielding a posterior distribution over s).In this case, the agent's objective is to find an s such that: Applying Bayes' theorem, and ignoring the denominator since it does not depend on the choice of s, the agent seeks an s such that: If we also continue with the Gaussian assumptions described above, these probabilities are: Plugging these in and taking their logarithms, the agent seeks to maximize F (negative free energy): Note that the term C gathers together everything that does not depend on the choice of s and so can be ignored.In order to maximize F (equivalent to minimizing free energy), the agent needs to minimize the sum inside the brackets.That sum refers to a measure of precision-weighted prediction errors within the network.Specifically, the first term is the difference between the actual and predicted value of o, normalized by its precision-weighting.The second term is the difference between the estimated and (prior) expected value of s, again normalized by its precision-weighting.In terms of network values, maximizing F is equivalent to minimizing: This describes the activation level of the error units, e and e .Thus, if the network were to minimize precision-weighted prediction error activity in these two layers, it would maximize F and thereby perform maximum a posteriori inference.Larger networks with a similar architecture can go beyond simple point estimation to implement full Bayesian inference (Friston, 2005;Friston, 2010;Friston, 2008).

The neural implementation level
In standard presentations, PP is claimed to be implemented primarily in the neocortex, which contains (depending upon the criteria one adopts) between roughly 50 and 200 anatomically/functionally distinct cortical areas.These appear to be connected in a roughly hierarchical fashion (at least until one reaches higher association cortices) via excitatory (ascending) and inhibitory (descending) synaptic pathways (Douglas & Martin, 2004;Felleman & Van Essen, 1991).In the simplest schemes, these neural pathways implement the weighted connections (W, W , . ..) between layers of prediction and error units.Descending, inhibitory synaptic pathways between cortical areas implement the descending, inhibitory weighted connections in the algorithm (e.g., connections from y to e).Ascending, excitatory synaptic pathways in turn implement the ascending, excitatory weighted connections (e.g., connections from e to y).Lower cortical areas, closer to sensory or motor boundaries, correspond to lower layers of the network model (closest to the input, x).Higher cortical areas correspond to higher layers of the network (further from input, x).The overall pattern of synaptic connectivity across the cortical hierarchy encodes the brain's generative model of sensory input.
In Fig. 2, for example, an individual cortical area might correspond to a pair of vertically aligned prediction and error nodes (e.g., y and e ).In this case, the prediction node (e.g., y) would be implemented by deep pyramidal cells (sitting in layers V and VI), which send inhibitory signals to superficial ("error") neurons in lower cortical areas in the hierarchy, and are targets of excitatory synapses from superficial cells in those lower cortical areas.The error node (e.g., e ) would be implemented by superficial pyramidal cells (sitting in layers II and III), which send excitatory signals to deep ("prediction") neurons in higher cortical areas in the hierarchy (and are targets of inhibitory synapses from deep cells in those higher cortical areas).The numerical values that feature in the algorithm are assumed to be encoded by activity levels in these cells (e.g., membrane depolarization levels, firing rates) using some sort of temporal or rate-based, population-coding scheme.Individual components of these abstract prediction/error signals-y 1 , . . ., y n in y, or e 1 , . . ., e n in e -are assumed to be implemented by functionally distinct subpopulations of superficial or deep cells inside a cortical area.Each prediction and error unit pair-(y i , e i )-might, for example, correspond to deep and superficial pyramidal cells within a single cortical column (Bastos et al., 2012).
Precision weights ( , , . ..) could be implemented in multiple ways.One possibility is through inhibitory synaptic connections between cells that encode the error signal (Bogacz, 2017).However, the distribution of precision weights sometimes needs to change rapidlysuch as during shifts in the agent's attention.Friston suggests two fast-acting physical mechanisms that could be responsible for modulating activity in error units (Friston, 2010).One is dynamic adjustments to neuromodulator (e.g., dopamine, norepinephrine, and acetylcholine) release in the vicinity of the relevant error cells, which can lead to effects such as: changes in intrinsic firing rates, changes in firing rate thresholds, suppression of adaptation effects, or altered efficacy of existing synaptic connections.The other is fast gamma-band synchronization, which can selectively boost the gain of certain error cells without affecting the responsiveness of the deep ("prediction") cells (Bastos et al., 2012).These two mechanisms might work in tandem or combine with any number of other physical processes to jointly implement the effect of precision weighting over error units.
Bastos and colleagues provide a detailed account of how predictive coding might be implemented in the neocortex-the so-called canonical microcircuit (Bastos et al., 2012).This proposal implements a more sophisticated variant of predictive coding that affords predictions at hierarchically nested temporal scales (also see Shipp, 2016).Kanai and colleagues have also described how noncortical circuits (in the thalamus) could play a role in implementing predictive coding (Kanai, Komura, Shipp, & Friston, 2015).More speculatively, Clark has explored the possibility that processes external to the brain (e.g., in the non-neural body or in the environment) could play a role in implementing parts of this algorithm (in particular, by storing some precision weights in external environmental or bodily features; Clark, 2016).

Decision-making and active inference
Thus far, we have used predictive coding as a way of illustrating several abstract features of the broader PP framework.However, predictive coding is just one algorithm and, as described, it is limited to perceptual inference.In this section, we move to models of decision-making within PP.To aid intuitive understanding, we also provide a concrete example of how one such decision algorithm can be used to perform simulations, and how it can be applied in empirical research.Specifically, we will consider a commonly used active inference algorithm for solving POMDPs (Da Costa, Parr, Sajid, Veselic, Neacsu, & Friston, 2020;Friston, FitzGerald, Rigoli, Schwartenbeck, & Pezzulo, 2017;Friston, Rosch, Parr, Price, & Bowman 2018;Friston et al., 2016;Friston, FitzGerald, Rigoli, Schwartenbeck, & Pezzulo, 2017;Parr & Friston, 2018;Smith et al., 2022b).
For clarity, this POMDP formulation should be distinguished from an older use of the term "active inference" in previous literature-which instead referred to a PP model of motor control (Adams et al., 2013;Brown, Friston, & Bestmann, 2011;Friston, Daunizeau, Kilner, & Kiebel, 2010).This motor control formulation was mentioned above and offered a theory of how prediction signals could act as motor commands if they were transiently afforded precise influences on the set points of spinal reflex arcs.Specifically, it demonstrated how, if proprioceptive prediction signals were highly weighted (i.e., so that they were not updated by contradictory sensory information about body position), they could lead the body to move to the position associated with the new set point (i.e., corresponding to the descending prediction).
In some discussions (e.g., see Clark, 2015), this kind of motor control theory has also been extended to hierarchical control settings in which higher-level, compact motor plans can be progressively unpacked through descending levels-ultimately resulting in low-level predictions that can control many motor processes in parallel over extended timescales.For example, the plan to go to the store could set a lower-level plan to walk toward your car and start the engine, which could set a yet lower-level plan to take a sequence of steps, and so forth until reaching dynamic adjustments of set points in spinal reflex arcs.While important, we do not focus on this theory of motor control here.The crucial point is that, even in its hierarchical guise, this was not a theory of decision-making. 3Another decision-making process would still be necessary to decide which high-level action plan to generate in the first place.It is this decision process that we focus on here.
The POMDP formulation of active inference can be cast in terms of the same belieflike elements as other PP models-such as predictions, prediction errors, and precisions.However, these elements now come in many flavors, because decision-making requires a more complex computational architecture.In general, POMDP architectures assume "partial observability," which means that an agent must infer the states of the world from imprecise observations that admit multiple interpretations.They also incorporate a Markov decision process, which entails that those states evolve over time in a manner that depends on an agent's choices.Because this POMDP structure is (assumed to be) represented in the brain as a generative model, this allows an agent to possess prior beliefs (and associated predictions/precisions) associated with expected observations, expected initial states, and expected transitions between states, as well as distinct prior beliefs reflecting confidence in action selection and the influence of habits.Decision processes also require goals, which entail that some observations are preferred over others.Decision-making then aims to generate those preferred observations.In the active inference formalism, preferences for observations can 17568765, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/tops.12704 by Edinburgh University, Wiley Online Library on [15/11/2023].See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions)on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License also have a precision weighting.This is because they are mathematically represented using a special type of probability distribution-a "prior preference distribution"-where a higher "probability" indicates that something is more preferred.While it is treated mathematically as a probability distribution, its psychological-level description maps best onto constructs such as goals or desired outcomes, depending on the context (Smith et al., 2022c).As we will see, different preference precision weightings can lead an agent to be either more risk-seeking or more information-seeking.
In the following section, we lay out the mathematical formalism underlying active inference on POMDPs.This will illustrate how, based on the various beliefs mentioned above, agents can minimize two distinct types of prediction errors in decision-making.One is a state prediction error, which is minimized with respect to current observations to perform perceptual inference.This bears some resemblance to predictive coding (Bogacz, 2017;Walsh, McGovern, Clark, & O'Connell, 2020), but the states are here assumed to be categorical (e.g., being in a house or a car) as opposed to continuous (e.g., how fast an object is moving).Unlike predictive coding, these state prediction errors are also action-dependent, in that predictions about future states will differ depending on the choices one is considering.The other type of prediction error could be described as an expected preference prediction error, which is minimized in decision-making to generate preferred outcomes.Here, agents choose actions that are expected to minimize this error, which corresponds to the deviation between expected and preferred observations.We will also see how state prediction errors (which drive perceptual belief updating) are associated with variational free energy, while preference prediction errors (which drive action selection) are associated with expected free energy.
At the outset, however, we wish to emphasize that predictive coding and active inference should be seen as complementary theories.They are mutually compatible and largely apply to distinct domains.Predictive coding remains a promising model of perceptual inference in the continuous state domain (e.g., inferring brightness, loudness, motion direction, etc.), while active inference aims to model decision-making under uncertainty and is most often applied in the discrete (categorical) state domain (e.g., recognizing a cat vs. a dog).In principle, these could also be combined in hierarchical models, where the passive perceptual inference process described in predictive coding would remain a special case in contexts where an agent has limited control over states (or where the environment does not offer epistemic or pragmatic affordances that would motivate action beyond the expected metabolic costs of movement).There are also so-called mixed models in the active inference framework that link discrete and continuous models hierarchically and bear some resemblance to this idea, while also incorporating motor control (e.g., see Friston, Parr, & de Vries, 2017).Thus, the framework described below should be seen as offering tools for modeling circumstances outside the domain of predictive coding, and not as a competing theory.

Defining a POMDP
To create a generative model (POMDP) on which an agent can perform active inference, we first define a set of states of the world (s)-that is, internal or external causes of sensory signals-that could be present at any point in time (τ ).We include a probability distribution p(s τ = 1 ) encoding beliefs about the probability of each state/cause being present at some initial time point.We then define beliefs about the sequences of actions the agent could consider-termed "policies"-denoted by π .Each policy predicts a specific sequence of state transitions (i.e., those expected to occur if the associated actions were chosen).This means we need to specify how the agent believes states will change from one time to the next under each possible policy, p(s τ +1 |s τ , π ).At each time point, the agent is modeled as receiving observations (o τ ; i.e., sensory input), and must use these observations to infer posterior beliefs over states.This is based on a mapping that specifies how states generate observations p(o τ |s τ ), often called the "likelihood" mapping.Because each policy entails a different sequence of state transitions, and each state generates specific observations, the agent can use its model to predict the observations expected under each policy, p(o τ |π ), and observed outcomes can then provide different amounts of evidence for some policies over others.Importantly, since we can only approximate optimal belief updating by attempting to minimize variational free energy, we need to define an approximate posterior belief distribution under each policythe agent's best guess-denoted as q(s τ |π ); this will be updated with each new observation, corresponding to perception.
With this model in place, we assume the agent computes the variational free energy (F ) for each policy as follows: The first equation indicates that F is the difference between an approximate posterior belief, q(s τ |π ), and a generative model, p(o τ , s τ |π ).The second equation represents an algebraic rearrangement of the first, often shown for greater ease of interpretation (for explicit steps to arrive at this rearrangement, see Smith et al., 2022b).The first term on the right-hand side of this second form of the equation is often called the "complexity" term.This is the KL divergence between prior and approximate posterior beliefs (i.e., a measure of how much beliefs change after a new observation).This term entails that a larger change in beliefs will generate a higher F value, which is disfavored.The second term on the right is predictive accuracy (i.e., the probability of observations given one's beliefs about states).Less predictive accuracy (i.e., greater prediction error) will also lead to higher F values.Similar to its use in predictive coding, this equation for variational free energy in POMDPs, therefore, says that minimizing F maximizes belief accuracy while also changing beliefs as little as possible.This is evaluated separately with respect to possible courses of action (i.e., which can each entail different prior beliefs).Here, it is important to clarify that, despite being calculated under each policy, F can only be evaluated with respect to past and present observations, and thus primarily serves perception.After the selection of a chosen action (via further computations described below), the resulting observation can be used to evaluate how states of the world have changed.However, if these observations are inconsistent with what was expected under a given policy, this can help to adjust beliefs about the optimal policy going forward.For example, if one chose to look to the left expecting to see a square, and instead observed a circle, this would decrease confidence in policy selection going forward (e.g., leading subsequent choices to be less deterministic).
In the neural process theory associated with perception in active inference, this is accomplished by minimizing a particular type of state prediction error, 4 where greater error is associated with higher F values (Friston, FitzGerald, Rigoli, Schwartenbeck, & Pezzulo, 2017;Parr, Markovic, Kiebel, & Friston, 2019).These errors can be written as: In this notation, the left-pointing arrows indicate value updating, ε π,τ is the state prediction error under each policy and time point, B π,τ is a matrix encoding p(s τ +1 |s τ , π ), A is a matrix encoding p(o τ |s τ ), v π,τ is an auxiliary variable used to represent a simulated neural membrane voltage, and s π,τ = q(s τ |π ).The σ symbol denotes a "softmax" operation that converts the resulting quantity within the parentheses back into a proper probability distribution.This allows s π,τ to be interpreted as a normalized firing rate.It can be shown that by iterating this set of equations several times, ε π,τ will approach a minimum (corresponding to minimization of F ) and q(s τ |π ) will converge to a stable, approximately optimal value.One can read ε π,τ as calculating the difference between the predictions of the generative model (the expression on the right-hand side within the parentheses) and the current posterior estimate (i.e., −lns π,τ ).In the neural process theory, the rate of change in posterior beliefs is typically taken as a prediction about the magnitude of measured neural responses, such as event-related potentials in electroencephalography (EEG) research (for examples of studies showing how these predictions are consistent with experimental results, see Parr, Markovic, Kiebel, & Friston, 2019;Parr & Friston, 2018;Parr & Friston, 2017;Smith et al., 2021c;Whyte, Hohwy, & Smith, 2022;Whyte & Smith, 2021).A detailed depiction and explanation of how a neural network inspired by a cortical column-like structure could implement these processes can be found elsewhere (Parr et al., 2019;Parr & Friston, 2018;Smith et al., 2022b).
When we move from perception to decision-making, we require more than just beliefs about past and present states.We also need predictions about future states and observations.To evaluate policies, we then need to compute an expected free energy, G(π ), for the observations predicted under each of the policies one might choose: The first equation here is identical to that for F shown above, except that observations have been included in the expectation, E q(o τ ,s τ |π ) .Some algebraic rearrangement, and introduction of a vector, C, results in the second equation (for explicit steps to arrive at this rearrangement, see Smith et al., 2022b).The first term on the right-hand side of this second equation encodes the anticipated change in beliefs over states after getting a new observation.The larger this "epistemic value" quantity, the more information the agent expects to gain (i.e., the greater the reduction in uncertainty will be).Because this quantity is subtracted, greater epistemic value leads to smaller expected free energies, which are favored.The second term on the right is the probability of observations given the prior preference distribution encoded in C. Finding the policy expected to maximize this value, which will minimize G, will, therefore, be favored because it is likely to produce preferred observations.This can also be loosely interpreted as minimizing a type of prediction error (i.e., choosing policies to minimize the divergence between true observations and those predicted by C).Combining both terms in the equation, policies expected to maximize both information gain and preferred outcomes will have the lowest expected free energy.
As mentioned above, the precision of the preference distribution C can have important influences on behavior.This is because, if this distribution is more precise, it means that this term will be weighted more heavily than the epistemic value term when calculating the expected free energy.This has the effect that policy selection is less driven by uncertainty minimization and will tend to seek reward directly despite incomplete information.If preference precision is low, the agent will instead be driven to maximize information gain before deciding how to best maximize preferred observations.This, therefore, governs how an agent handles the explore-exploit dilemma-the difficult judgment of how much one should explore and understand the environment before exploiting current beliefs to (try to) maximize reward.
Once the expected free energy is known, the posterior probability distribution over policies can then be calculated as: This simply says that the most likely policies are those that minimize expected free energy.In some of the active inference literature, there is also an extended version of this equation: This further incorporates p(π ), which is a vector encoding a prior belief over policies that can be used to model habits.Here, learning habits simply means that the agent is more likely to select a policy if that policy was selected more often in the past.The γ term is a scalar (single number) that modulates the precision of the expected free energy-encoding the agent's prior confidence in its ability to select optimal actions.Low values for γ make action selection more random; if p(π ) is precise, a low γ value can also make action more driven by habits.We note here that the value for γ can also be updated with new observations (by further incorporating the variational free energy of each policy after those observations), but the associated equations are beyond the scope of this paper (the interested reader is referred to Smith et al., 2022b).Heuristically, γ values will increase if new observations provide added support for the currently favored policy (e.g., suggesting G(π ) is successfully guiding choice), and decrease if observations instead reduce support for that policy.When included, the agent can start with an initial prior (expected value) for γ , which is typically taken from 17568765, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/tops.12704 by Edinburgh University, Wiley Online Library on [15/11/2023].See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions)on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License Fig. 3. Left: Graphical depiction of the probability distributions and dependencies between variables within the active inference POMDP formalism described in the text.In the notation used here, p(•) denotes probabilities in the generative model used to predict observations (o), while q(•) denotes the approximate posterior probabilities the agent must infer when observations are received.States (s) at each time (τ ) generate observations, and transitions between states depend on policies (π ).The probability of choosing a policy depends on its expected free energy (G), which in turn depends on the probability of generating preferred observations (C).Note that, in much of the active inference literature, p(π ) is encoded in a vector E , p(s 1 ) is encoded in a vector D, p(s τ +1 |s τ , π ) is encoded in a set of matrices B π,τ , and p(o τ |s τ ) is encoded in a matrix A. In more complex generative models, these matrices can also become higher-dimensional tensors.Right: An example of simulated firing rates (darker = higher firing rate) and event-related potentials (ERPs) based on the neural process theory in active inference.This simulation used a simple generative model with two states and one policy, where the agent receives a single observation that indicates, with probability = .9,which state it is in (i.e., favoring State 2).As described in the text, the firing rates map to the changing values in q(s τ |π ), and the ERPs map to the rate of change in these values.
a gamma distribution with a shape parameter equal to 1 and a rate parameter equal to β (parameterizing the agent's initial uncertainty).All model dependencies described here are depicted in graphical form in the left panel of Fig. 3.The right panel of this figure also shows an example of simulated firing rates and event-related potentials in a simple model with two states where the agent receives a single observation that indicates, with probability = .9,which state it is in.
Actions are chosen at each time point (t) based on beliefs over policies as follows: The parameter α is called an inverse temperature or "action precision" parameter.Lower values introduce randomness into choice, by increasing the probability of choosing actions that are not entailed by the favored policy.Although we will not describe this in detail, we note that time is denoted here as t because it refers to the current time point.This is distinct from the time variable τ above, which refers to the times about which agents can have beliefs.For instance, a new observation at time point t = 2 (e.g., after turning on a light) could update an agent's beliefs about the state it was in at the previous time point τ = 1 (e.g., what room it was in before turning on the light).
Finally, learning in active inference is accomplished by incorporating, and updating, prior beliefs over model parameters.For example, if an agent needed to learn the probability of observing a preferred outcome when in some state, this would involve updating the parameters within p(o τ |s τ ) after making each new observation when in that state.Mathematically, analogous processes can also be used to learn parameters associated with state transition probabilities and prior probabilities over policies (habits), among other model elements.This process can also be controlled by rates of learning and forgetting that modulate how the relevant model parameters change over time with new observations.Formally, the priors over parameters that are learned take the form of Dirichlet distributions, and learning proceeds by updating the so-called concentration parameters (or "counts") within those distributions.This amounts to a simple Hebbian learning process in which coincidences (e.g., between observations and states, or between states at two consecutive time points) are treated as evidence of probabilistic relationships between variables.Learning p(o τ | s τ ), for example, is implemented as follows: Here, the concentration parameters (a 1 , a 2 . ..) in the Dirichlet distribution (a) represent the probability of each of two observations (rows) given each of two states (columns).The second equation simply says that if, for example, the agent was in the left state and received the first (top row) observation, the value for a 1 (where they intersect) should increase (i.e., the expected probability of that observation under that state should increase).Here, ω values less than 1 allow the agent to gradually forget past experience, and η values control how quickly beliefs change after each new observation.Some mathematical details are beyond the scope of this presentation (for details, see Smith et al., 2022b), but there is one crucial point worth highlighting that pertains to how agents seek out information to improve their generative models of the world.Namely, when learning is included (i.e., when there is uncertainty in model parameters), the expected free energy requires an additional term.This term differs depending on what parameter is being learned; for p(o τ | s τ ), for example, it would be the final term on the right in: Here, A = p(o τ | s τ ), so the added term simply scores the expected divergence between beliefs about p(o τ | s τ ) before and after the new observation expected under a given policy.Since it is made negative, this says that policies expected to change beliefs about p(o τ | s τ ) the most will have the lowest expected free energy and will be preferred.This, therefore, drives the agent to seek out observations expected to improve its model of the world.(3) a highly precise preference for one observation over others (Right: "High Preference Precision" simulation).As shown in the top-left, darker colors indicate higher probabilities and cyan dots indicate the true actions or observations.In this simulation, an agent wakes up in the living room after watching TV and it wants to find its way to the bedroom to fall asleep for the night (i.e., in the simulations where it has a non-flat preference distribution).However, the room is dark and so it is not clear which of the two doors leads to the bedroom.The agent can, therefore, either choose to stay in the living room, guess at one of the doors (which could delay sleep if it is wrong), or turn on a light first to see which one is the bedroom door.As can be seen, with a flat preference distribution, the agent is information-seeking and turns on the light, but then has no motivation to make one choice over others and picks randomly.With moderate preference precision, the agent turns on the light and then confidently picks the bedroom door.With high preference precision, the agent is risk-seeking and guesses at a door without turning on the light (i.e., a 50-50 chance of getting it right).This illustrates how generative models with different precision values would result in different patterns of choice behavior (and could capture such behavior differences in real participants in empirical studies).These simulations are taken from the generative model described in Smith et al. (2022c).This concludes our review of the active inference formalism.As a brief example of the inferential and action selection dynamics that emerge under this formalism, Fig. 4 shows a set of simulations in a simple explore-exploit context.In this example, the agent wakes up in its living room after watching some TV and wants to find its bedroom to fall back asleep for the night.However, because it is dark, the agent cannot see which of the two available doors will lead to the bedroom.The agent has three options: (1) it can stay in the dark room; (2) it can immediately guess at a door in the dark, which might get it to the bed more quickly (if it is lucky), but it might also lead the wrong way; or (3) it can first turn on a light and then choose a door after it sees which one leads to the bedroom.In the left "Epistemic Value Only" simulation in Fig. 4, the agent's preferences are set to have no desire for sleep (i.e., a flat preference distribution; bottom left).In this case, the agent first chooses to turn on the light to maximize information gain (i.e., due to the epistemic value term in the expected free energy).Based on the subsequent observation, it then updates its beliefs over states via minimization of variational free energy (i.e., state prediction error; here becoming confident that the bedroom door is on the left).At that point, without preferences incorporated into the expected free energy, it has no subsequent drive toward one action over another (equal probability over actions at the second time point).In the middle simulation, the agent has a moderate preference to go to sleep (see preference distribution in the bottom-middle panel).In this case, the agent first turns on the light to see which is the correct door, and then goes into the bedroom to fall asleep.In the right simulation, the agent has a highly precise preference distribution (which effectively down-weights the epistemic value in the expected free energy), leading it to take a risk and choose one of the doors at random.In this simulation, the agent got lucky and picked the correct door, but this would only happen 50% of the time with repeated simulations.
These simulations demonstrate how an active inference agent will behave in simple situations under different model parameter values (e.g., with more or less precise preference distributions).This opens up the possibility that, in scientific studies, we can find the modelsand different parameter values in those models-that best explain actual human cognition and behavior.In the next section, we will briefly consider some specific examples of scientific contexts in which active inference might further our understanding of neurocognitive processes.

Applications
The formalism described above allows for at least two major scientific applications: simulation experiments and statistical inference on empirical data.
In simulation experiments, one first constructs generative models of behavior based on a priori theoretical hypotheses and then demonstrates how particular computational processes can explain previous scientific findings.Such simulations provide a precise, concrete characterization of proposed theories and sometimes help demonstrate the consistency or inconsistency of various findings with different theories.One might also use the neural process theory associated with active inference to test whether previous neuroscientific observations can be explained by different models.
Regarding statistical inference on empirical data, the active inference formalism has been used to create generative models of behavioral tasks, such as explore-exploit tasks (Smith et al., 2022a;Smith et al., 2020c, Taylor at al., 2023), approach-avoidance conflict tasks (Smith et al., 2021b;Smith et al., 2021a;Smith et al., 2023), and interoception tasks (Smith et al., 2021c;Smith et al., 2020b;Smith et al., 2020a, Lavalley et al., 2023).In such studies, several possible generative models are considered, which represent distinct hypotheses about the computational processes engaged while a participant completes the task.Specific algorithms can then be used to fit the parameters of each model to the behavioral data (i.e., find the values of parameters in a model that maximize the probability of each participant's actual choices).After this fitting process, the probability of participant data under each model can be compared to identify the best model.The parameter values for each person in the winning model can then be used to test for individual and group differences with respect to a specific scientific question.
For example, in the explore-exploit task studies mentioned above (Smith et al., 2022a;Smith et al., 2020c), individuals with substance use disorders were found to have lower action precision and slower learning rates than healthy individuals in response to negative outcomes (recently replicated in: Taylor et al., 2023).Multiple model parameters also showed specific relationships with changes in symptom severity over time.In the approach-avoidance conflict studies (Smith et al., 2021b;Smith et al., 2021a;Smith et al., 2023), individuals with depression, anxiety, and substance use disorders showed greater decision uncertainty (i.e., lower initial γ values) and less emotion conflict (i.e., more precise preferences for reward despite expected unpleasant stimuli in C) than healthy individuals, and this group difference was stable over the course of 1 year.In the interoception studies (Smith et al., 2021c;Smith et al., 2020b;Smith et al., 2020a), healthy individuals were found to have greater sensory signal precision estimates than several psychiatric patient groups in cardiac perception (recently replicated in Lavalley et al., 2023), and both sensory precision and learning rate parameters in gastrointestinal perception were correlated with EEG and other peripheral physiological responses.
Such results point to the scientific and potential clinical value of active inference models.Yet, only a small number of empirical studies have taken this approach to date.Many more studies will be needed to establish the unique benefits offered by this class of models.Active inference models will also need to be more thoroughly compared to other computational models of empirical behavior.In most cases, we do not yet know whether active inference models will provide a better explanation of cognitive and behavior data than other commonly used models, such as reinforcement learning or drift-diffusion models.There are also important unanswered questions about how generative models develop early in life.While a couple of simulation studies have begun to consider Bayesian model expansion and reduction as possible descriptions of generative model development (Smith et al., 2020d;Friston et al., 2017), these have important limitations (Rutar, Wolff, Rooij, & Kwisthout, 2022), and empirically realistic models that would afford experimental testing are in early stages of development (but see Rutar et al., 2023).This will be an important direction going forward, as PP in general, and POMDPs in particular, offer a rich set of resources with which developmental processes might be captured.

Conclusion
In this paper, we have reviewed two major threads in the PP framework, focused on perception and decision-making, respectively.We have described how the state of this field of research has evolved over time, and highlighted specific areas where future work is needed to make further advances.By providing both mathematical details and concrete examples, we hope to have offered the reader both a foundation from which to understand more technical applications as well as intuitions about the dynamics of these models.We also hope to have clarified the relationship between PP as an umbrella conceptual framework and more specific (potentially competing) testable theories that fall within this broader framework.We look forward to further developments that will allow for more precise testing of hypotheses derived from specific PP models at the computational, algorithmic, and implementation levels of description.convenience, as it converts problems of multiplication into less cumbersome problems of addition.For example, p(a, b) = p(a|b)p(b) can also be computed as log p(a, b) = log p(a|b) + log p(b).
3 Although it could be used to describe more "automatic" action selection processes that do not involve prospective consideration of possible action outcomes (i.e., instead involving a more direct mapping from states to actions).For example, a prediction error between observed and "expected" (i.e., physiologically viable) glucose levels could be set up to directly increase the activity of units predicting that the agent will pick up and eat some food.4 In more recent literature, this state prediction error, and associated free energy, has been modified slightly to address some known limitations of variational approximations associated with overconfidence (i.e., implementing a modified scheme called marginal message passing; Harman, 1973).In brief, this modified scheme assigns reduced weighting to the prior terms (B π ) and adjusts the form of backward messages to better approximate an alternative algorithm called belief propagation, which has superior performance, but at greater computational cost.

Fig. 4 .
Fig.4.Simulations of an active inference agent with: (1) a maximally imprecise preference distribution (no preference for some observations over others; Left: "Epistemic Value Only" simulation); (2) a moderately precise preference distribution favoring one observation over others (Middle: "Moderate Preference Precision" simulation); and (3) a highly precise preference for one observation over others (Right: "High Preference Precision" simulation).As shown in the top-left, darker colors indicate higher probabilities and cyan dots indicate the true actions or observations.In this simulation, an agent wakes up in the living room after watching TV and it wants to find its way to the bedroom to fall asleep for the night (i.e., in the simulations where it has a non-flat preference distribution).However, the room is dark and so it is not clear which of the two doors leads to the bedroom.The agent can, therefore, either choose to stay in the living room, guess at one of the doors (which could delay sleep if it is wrong), or turn on a light first to see which one is the bedroom door.As can be seen, with a flat preference distribution, the agent is information-seeking and turns on the light, but then has no motivation to make one choice over others and picks randomly.With moderate preference precision, the agent turns on the light and then confidently picks the bedroom door.With high preference precision, the agent is risk-seeking and guesses at a door without turning on the light (i.e., a 50-50 chance of getting it right).This illustrates how generative models with different precision values would result in different patterns of choice behavior (and could capture such behavior differences in real participants in empirical studies).These simulations are taken from the generative model described inSmith et al. (2022c).