Optimal Allocation of Finite Sampling Capacity in Accumulator Models of Multialternative Decision Making

Abstract When facing many options, we narrow down our focus to very few of them. Although behaviors like this can be a sign of heuristics, they can actually be optimal under limited cognitive resources. Here, we study the problem of how to optimally allocate limited sampling time to multiple options, modeled as accumulators of noisy evidence, to determine the most profitable one. We show that the effective sampling capacity of an agent increases with both available time and the discriminability of the options, and optimal policies undergo a sharp transition as a function of it. For small capacity, it is best to allocate time evenly to exactly five options and to ignore all the others, regardless of the prior distribution of rewards. For large capacities, the optimal number of sampled accumulators grows sublinearly, closely following a power law as a function of capacity for a wide variety of priors. We find that allocating equal times to the sampled accumulators is better than using uneven time allocations. Our work highlights that multialternative decisions are endowed with breadth–depth tradeoffs, demonstrates how their optimal solutions depend on the amount of limited resources and the variability of the environment, and shows that narrowing down to a handful of options is always optimal for small capacities.

Since the variance is the expected value of a positive quantity, then we conclude that the expected value of the drift is a monotonously increasing function of the observed accumulated evidence x for any prior.

A.3 Expected value of the drift in the small capacity limit
Here we show that in the small capacity limit, the utility in Eq. (9) can be written as in Eq. (11) for any regular prior distribution. Our strategy is to study the limiting behaviors of the cumulative density function (described below in Sec. A.4) and the posterior mean of the drift (detailed in this section) that appear in Eq. (9) as C = σ 2 0 σ 2 T goes to zero. From Bayes's rule, Eq. (3), the posterior mean of the drift is given bŷ .
Let us focus on the numerator, which we will interpret as the expectation value of µ exp − 1 2σ 2 t (µt − x) 2 with respect to the prior. We assume the prior to be such that this expectation is finite for all x and that all its moments are finite (e.g, Gaussian and uniform distributions). We define z ≡ z( (µt − µ 0 t), and by adding and subtracting µ 0 t at the exponent, we can write the numerator in the above equation as Next, we note that the exponential in the expectations is the generating function of the Hermite polynomials, and thus He n (z) µ n s n! .
By replacing the exponential with the infinite series in the above expectation, Eq. (A.2), we obtain where we have used that all the moments of the prior are finite and the sum is well defined. Note that to obtain the third line we have shifted the second index n + 1 → n and used that the term nHe n−1 (z) is zero for n = 0. We now insert the above series into the expression of utility in Eq. (9) to obtain where in the second line it is implicit that z depends on x, and in the last line we have made a linear transformation of variables from x to z = z(x). We also note that as the integral in the last line only involves polynomials in z that are weighted by the standard normal (and by a cumulative, which is bounded to be in the range [0, 1]), their integrals are finite, and thus we can truncate the series at the first leading order, which is order √ C. It remains to see whether the cumulative density function F z (z|t, σ, θ) contributes order √ C or larger, and we show below in Sec. (A.4) that the former is actually true, such that F z (z|t, σ, θ) With all this, we can approximate the utility up to order √ C aŝ which is identical to Eq. (11) in main manuscript.

A.4 Distribution of evidence at small capacity limit
Here, we find an approximation to the marginalized probability distribution of the evidence at small capacity. From Bayes' rule and the law of the unconscious statistician, To compute this expectation, we follow the same procedure as in Sec. A.3. We define z ≡ 1 (µt − µ 0 t) and add and subtract µ 0 t at the exponent, to obtain Next, we again identify the exponential generating function of the Hermite polynomials, He n (z) µ n s n! , and thus we obtain a series for the probability distribution of the evidence, We see that the leading order the distribution of the evidence is a normal distribution, while the order √ C is zero. Therefore, its cumulative in the variable z = z(x) is, exactly, up to order √ C, F z (z|t, σ, θ) =

A.5 Asymptotic limit of relevant integral
In this subsection we want to obtain the asymptotic limit, M → ∞, of the integral appearing in Eqs. (11) and (12), where Φ(y) is the normal cumulative distribution function, Using Extreme Value Theory [1], it can be shown that this cumulative distribution function Φ(y) belongs to the Gumbel class of the generalized extreme value distributions, Using this result, then our integral develops quite easily, where γ ≈ 0.577 is Euler's constant.

A.6 Expected utility for uniform prior
For this choice of prior, drifts are all drawn independently and identically from a uniform probability distribution between zero and one. That is, where Θ(x) is the Heaviside step function. We can substitute this prior into eq. (3) to obtain the posterior probability distribution for the drifts, otherwise.
This will produce an expectation value for each drift, where the denominator is related to the probability distribution of the evidence x i , which we can find by marginalizing over drifts, We will use from now on the assumption of even time allocation, t i = t = T M for all i. The cumulative probability distribution for the evidence in Eq. (A.5) is, integrating by parts, where in the last equality we have rewritten the solution in a convenient form. Hence, the product of the expected value with the probability density can be rewritten in terms of the cumulative function, from the previous equation, and using eq. (9) we get the expression for the utility,

A.7 Expected utility for Gaussian bimodal prior
The expected utility for the bimodal Gaussian prior with modes µ 1 and µ 2 , each with a variance σ 2 0 , is quite similar to the unimodal, Eq. (10), and follows the straightforward application of Eq. (9). The probability distribution of the evidence marginalized over drifts is p(x|t, σ, θ) = 1 2 N (x|µ 1 t, σ 2 t+σ 2 0 t 2 )+ 1 2 N (x|µ 2 t, σ 2 t + σ 2 0 t 2 ). Therefore the cumulative is where Φ(x|µ m , σ 2 m ) is the normal cumulative distribution for one mode. However, the expected value of the drift is a bit more involved, since the posterior distribution over drifts takes a different form, Consequently, the expected value will bê Then, expected utility iŝ This expression is numerically integrated and used in Fig. 4.

A.8 Stochastic gradient ascent method for Gaussian prior
To maximize utility, Eq. (14) in main manuscript, under the time constraint, we can make use of unconstrained optimization through Lagrangian multipliers. We construct the Lagrangian given by where h(t) = M i=1 t i − T is the equality constraint that defines the hyperplane and 0 ≤ g i (t) = t i is the inequality constraint forcing all times i to be non-negative and thus defining the simplex. The quantities λ and µ are the Lagrangian multipliers. In other words, maximizing utility, Eq. (14), subject to the initial constraints can be done by optimizing the Lagrangian, Eq. (A.8), with respect to t, λ and µ subject to Karush-Kuhn-Tucker conditions [2] We notice that the first two conditions imply that the third can be rewritten as µ i t i = 0 for all i. Next, we detail the gradient ascent method used to obtain Fig. 5e. As explained above, we want to optimize utility, Eq. (14), subject to a set of equality, Eq. (1), and inequality constraints, t i ≥ 0, as described in section "Even allocation is optimal" of the Results. As all our constraints are linear, we can make use of the gradient projection method [3]. In this case, we want to obtain the gradient of utility in Eq. (10) and project it in the (M − 1)-simplex such that the capacity constraint in Eq. (1) is satisfied. Due to the linear capacity equality constraint, this projection is simply given by the linear operator where Id M ×M is the M × M identity matrix and 1 M ×M is an M × M matrix full of ones. Therefore, we can maximize utility by updating t (k) appropriately, where η = 10 −1 T is the default step size, k is the iteration number, and θ corresponds to the parameters of the Gaussian prior. The utility for an arbitrary time allocation t for the Gaussian prior case is, using Eq. (14), We can therefore compute the derivative of the previous equation with respect to all components t i and numerically integrate the expression that results.
In addition to the linear capacity constraint, we have to enforce the inequality constraints as well, i.e. t i ≥ 0, which we do by utilizing an active set of constraints. To implement it, we start in a relatively highdimensional (M − 1)-simplex, choosing M to be 2M * , where M * is the optimal number of accumulators to sample in the even sampling case (which is estimated before through exploration, see main text). If and whenever any of the components of t k+1 derived from Eq. (A.11) is approaching a border (t (k+1) i ≈ τ for some i and small τ ), the step size decreases until the component effectively reaches zero. In such a case, this dimension is added to the active constraints set (we inactivate the dimension), thus downgrading the simplex to a lower dimension. In this way, our algorithm only reduces the initial dimension of the simplex and never extends it. To initially activate the 2M * dimensions, for any random initial condition t 0 , we make sure that all the components are greater than our threshold t 0,i > τ for all i = 1, ..., 2M * .
Finally, in order to avoid potentially getting trapped in local maxima, we add noise at every iteration as follows. At every step k of Eq. (A.11), and with probability = 0.1, we push the t (k) i of a randomly chosen dimension i by a magnitude δ = 10 −3 T and pull the t (k) j of another random dimension j in the opposite direction with the same amount in order to stay in the appropriate simplex.