Neurobiological successor features for spatial navigation

Abstract The hippocampus has long been observed to encode a representation of an animal's position in space. Recent evidence suggests that the nature of this representation is somewhat predictive and can be modeled by learning a successor representation (SR) between distinct positions in an environment. However, this discretization of space is subjective making it difficult to formulate predictions about how some environmental manipulations should impact the hippocampal representation. Here, we present a model of place and grid cell firing as a consequence of learning a SR from a basis set of known neurobiological features—boundary vector cells (BVCs). The model describes place cell firing as the successor features of the SR, with grid cells forming a low‐dimensional representation of these successor features. We show that the place and grid cells generated using the BVC‐SR model provide a good account of biological data for a variety of environmental manipulations, including dimensional stretches, barrier insertions, and the influence of environmental geometry on the hippocampal representation of space.

One way is to approach this problem from the field of reinforcement learning (RL). RL (Sutton & Barto, 2018) seeks to address how an agent should act optimally to maximize expected future reward. Consequently, a quantity often used in RL is the value V of a state s in the environment which is defined as the expected cumulative reward R, exponentially discounted into the future by a discount parameter γ [0, 1].
This equation can be rewritten by deconstructing value into the longrun transition statistics and corresponding reward statistics of the environment (Dayan, 1993). Here, the transition statistics, denoted by M, is called the successor representation (SR) which represents the discounted expected future occupancy of each state s 0 from the current state s.
The SR M encapsulates both the short-and long-term statetransition dynamics of the environment, with a time-horizon dictated by the discount parameter γ. Furthermore, changes to the transition and reward structure can be incorporated into the value estimates V(s) by adjusting M and R, respectively. These adjustments can be made experientially using a temporal-difference learning rule, which uses the difference between predicted outcomes and the actual outcomes to improve the accuracy of the predicted estimate (Sutton, 1988 igation that the brain does not represent space as a grid of discrete states, but rather uses an array of spatially sensitive neurons. In particular, boundary responsive neurons are found throughout the hippocampal formation, including "border cells" in superficial medial entorhinal cortex (mEC) (Solstad et al., 2008) and BVCs in subiculum (Barry et al., 2006;Hartley, Burgess, Lever, Cacucci, & Keefe, 2000;Lever et al., 2009). Because these neurons effectively provide a representation of the environmental topography surrounding the animal and-in the case of the mEC-are positioned to provide input to the main hippocampal subfields (Zhang et al., 2014), it seems plausible that they might function as an efficient substrate for a SR.
Thus the aim of this article is to build and evaluate a biologically plausible SR based on the firing rates of known neurobiological features in the form of BVCs (Barry et al., 2006;Hartley et al., 2000;Lever et al., 2009;Solstad et al., 2008). Not only does this provide an efficient foundation for solving goal-directed spatial navigation problems, we show it provides an explanation for electrophysiological phenomena currently unaccounted for by the standard SR model (Stachenfeld et al., 2017).

| MODEL
We generate a population of BVCs following the specification used in previous iterations of the BVC model (Barry & Burgess, 2007;Grieves, Duvelle, & Dudchenko, 2018;Hartley et al., 2000). That is, the firing of the i th BVC, tuned to preferred distance d i and angle ϕ i to a boundary at distance r and direction θ subtending at an angle δθ is given by: where, In the model, the angular tuning width σ ang is constant and radial tuning width increases linearly with the preferred tuning distance: σ rad (d i ) = d i /β + ξ for constants β and ξ.
Using a set of n BVC's, each position or state s in the environment corresponds to a vector of BVC firing rates f(s) = [f 1 (s), f 2 (s), …, f n (s)] ( Figure 1). We use a tilde to indicate variables constructed in the BVC feature space of f. By learning a SRM among these BVC features, we can use linear function approximation of the value function to learn a set of weightsR =R 1 ,R 2 , …,R n h i such that: where > denotes the transpose andψ s ð Þ =Mf s ð Þ is the vector of successor features constructed using the BVCs as basis features. Analogous to the discrete state-space case where the successor matrix M provides a predictive mapping from the current state to the expected future states, the successor matrixM provides a predictive mapping from current BVC firing rates f(s) to expected future BVC firing rates.
Importantly,M andR can be learnt online using temporal-difference where αM and αR are the learning rates for the SRM and weight vector R, respectively. Because Equation (6) is independent of reward R t , the model is still able to capture the structure of the environment in the absence of reward (R = 0) by learning the successor matrixM. In this manner, it inherently describes spatial latent learning as described in rodents (Tolman, 1948).
Consequently, we can learn through experience which BVCs are predictive of others by estimating the SR matrixM . More precisely, given the agent is at position s with BVC population firing rate vector  (6)  Similar to the BVC model (Hartley et al., 2000), the firing of each simulated place cell F i in a given location s is proportional to the thresholded, weighted sum of the BVCs connected to it: where T is the cell's threshold and The weights in the sum (Equation [8]) correspond to a row of the SR matrixM and refer to the individual contributions that a particular BVC (encoded by that row) will fire in the near future. Thus, assuming homogeneous behavior, sets of BVCs with overlapping fields will typically exhibit mutually strong positive weights, resulting in the formation of place fields at their intersection ( Figure 2a). The place cell threshold T was set to 80% of the cell's maximum activation.
Grid cells in the model are generated by taking the eigen decomposition of the SR matrixM and thus represent a low-dimensional embedding of the SR. Similar to the place cells, the activity of each simulated grid cell G i is proportional to a thresholded, weighted sum of BVCs. However, for the grid cells, the weights in the sum correspond to particular eigenvectorṽ i of the SR matrixM, and the firing is thresholded at zero to only permit positive grid cell firing rates.
This gives rise to spatially periodic firing fields such as those observed in Figure 2b.

| RESULTS
Following Stachenfeld and colleagues (Stachenfeld et al., 2017) (Stachenfeld et al., 2017). In contrast, BVC-SR derived place fields-like real place cells and those from the BVC model (Hartley et al., 2000;Muller, Kubie, & Ranck, 1987) Empirical work has shown that grid-patterns are modulated by environmental geometry, the regular spatial activity becoming distorted in strongly polarized environments (Derdikman et al., 2009;Krupic et al., 2015;Stensola, Stensola, Moser, & Moser, 2015). Gridpatterns derived from the standard-SR eigenvectors also exhibit distortions comparable to those seen experimentally. Thus, we next examined the regularity of BVC-SR eigenvectors derived from SR matrices trained in square and trapezoid environments. As with rodent data (Krupic et al., 2015) and the standard-SR model, we found that grid-patterns in the two halves of the square environment were considerably more regular than those derived from the trapezoid (mean correlation between spatial autocorrelograms ± SD: 0.68 ± 0.18 vs. 0.47 ± 0.15, t[318] = 10.99, p < 0.001; Figure 5b). Furthermore, BVC-SR eigenvectors that exceeded a shuffled gridness threshold (see supplementary methods)-and hence were classified as F I G U R E 3 BVC-SR derived place cells deform in response to geometric manipulations made to the environment. Scaling one or both axes of an environment produces commensurate changes in the activity of BVC-SR place cells (a). Such that firing field size scales proportionally with environment size (b, c), whereas the relative size of place fields is largely preserved between environments and Pearson correlation coefficient shown (d) grid cells-were more regular in the square than the trapezoid (mean gridness ± SD: 0.37 ± 0.17 vs. 0.10 ± 0.09; t[24] = 4.87, p < 0.001; Figure 5c). In particular, as had previously been noted in rodents (Krupic et al., 2015), the regularity of these "grid cells" was markedly Rodent grid-patterns have been shown to orient relative to straight environmental boundaries-tending to align to the walls of square but not circular environments (Krupic et al., 2015;Stensola et al., 2015). In a similar vein, we saw that firing patterns of simulated grid cells also were more polarized in a square than a circular environment, tending to clus-

| DISCUSSION
The model presented here links the BVC model of place cell firing with a SR to provide an efficient platform for using RL to navigate space. The work builds upon previous implementations of the SR by replacing the underlying grid of states with the firing rates of known neurobiological features-BVCs, which have been observed in the hippocampal formation (Barry et al., 2006;Lever et al., 2009;Solstad et al., 2008) and can be derived from optic flow (Raudies & Hasselmo, 2012). As a consequence, the place cells generated using the BVC-SR approach presented here produce more realistic fields that conform to the shape of the environment. Unlike previous SR implementations, the BVC-SR place fields respond immediately to environmental manipulations such as dimensional stretches and barrier insertions in a similar manner to real place cells.
Comparable to previous SR implementations, the eigenvectors of the SR matrixM display grid cell like periodicity when projected back onto the BVC state space, with reduced periodicity in polarized F I G U R E 4 Insertion of an additional barrier into an environment can induce duplication of BVC-SR place fields. (a) In 23% of place cells, barrier insertion causes immediate place field duplication. In most cases (81%), the duplicate field persists for the equivalent of 40 min of random foraging (learning update occurs at 50 Hz). (b) In some cases (19%), one of the duplicate fields-not necessarily the new one-is lost during subsequent exploration. Similar results have been observed in vivo (Barry et al., 2006) enclosures such as trapezoids. Furthermore, likely due to the experiential learning and the natural smoothness of the BVC basis features, the eigenvectors from the BVC-SR model exhibit more realistic variations among grid fields, resulting in a model of grid cells that is more similar to biological recordings than previous implementations of the SR. This form of eigen decomposition is similar to other dimensionality reduction techniques that have been used to generate grid cells from populations of idealized place cells with a generalized Hebbian learning rule (Dordek, Soudry, Meir, & Derdikman, 2016;Oja, 1982).
Previously, low-dimensional encodings such as these have been shown to accelerate learning and facilitate vector-based navigation (Banino et al., 2018;Gustafson & Daw, 2011).
The model extends upon the BVC model of place cell firing (Barry et al., 2006;Barry & Burgess, 2007;Hartley et al., 2000) by also pro- The BVC-SR eigenvector grid patterns are fragmented in a compartmentalized maze and repeat across alternating maze arms as has been observed in rodents (Derdikman et al., 2009). (h) The Pearson's correlation matrix between the grid patterns on different arms of the maze has a checkerboard-like appearance due to the strong similarity between alternating internal channels of the maze (n = 160 eigenvectors). Again, similar results have been noted empirically (Derdikman et al., 2009) produce similar place cells if the agent samples the environment uniformly, the policy dependence of the BVC-SR model provides a mechanism for estimating how behavioral biases will influence place cell firing. These models both use BVCs as the basis for allocentric place representations in the brain. As a consequence, they would be unable to distinguish between visually identical compartments based on boundary information alone. To achieve this, the models would require some form of additional information about the agent's past trajectory, such as a path integration signal. Theoretical evidence (Bicanski & Burgess, 2018;Byrne, Becker, & Burgess, 2007) suggests that recently discovered egocentric BVCs (Gofman et al., 2019;Hinman, Chapman, & Hasselmo, 2019) could provide the link between the egocentric perception of the environment to an allocentric representation in the hippocampal formation.
The focus of this work has centered on the representation of successor features in the hippocampus during the absence of environmental reward. However, a key feature of SR models is their ability to adapt flexibly and efficiently to changes in the reward structure of the environment (Dayan, 1993;Russek, Momennejad, Botvinick, Gershman, & Daw, 2017;Stachenfeld et al., 2017). This is permitted by the independent updating of reward weights (Equation [7]) combined with its immediate effect on the computation of value (Equation [5]). Reward signals analogous to that used in the model have been shown to exist in the orbitofrontal cortex of rodents (Sul, Kim, Huh, Lee, & Jung, 2010), humans (Gottfried, O'Doherty, & Dolan, 2003;Kringelbach, 2005), and non-human primates (Tremblay & Schultz, 1999). Meanwhile, a candidate area for integrating orbitofrontal reward representations with hippocampal successor features to compute value could be anterior cingulate cortex (Kolling et al., 2016;Shenhav, Botvinick, & Cohen, 2013). Finally, the model relies on a prediction error signal for learning both the reward weights and successor features (Equations [6-7]). Although midbrain dopamine neurons have long been considered a source for such a reward prediction error (Schultz, Dayan, & Montague, 1997), mounting evidence suggests they may also provide the sensory prediction error signal necessary for computing successor features with temporal-difference learning (Chang, Gardner, Di Tillio, & Schoenbaum, 2017;Gardner, Schoenbaum, & Gershman, 2018).
Successor features have been used to accelerate learning in tasks where transfer of knowledge is useful, such as virtual and real world navigation tasks (Barreto et al., 2017;Zhang, Springenberg, Boedecker, & Burgard, 2017). Although the successor features used in this paper were built upon known neurobiological spatial neurons, BVCs, the framework itself could be applied to any basis of sensory neurons that are predictive of reward in a task. Thus, the framework could be adapted to use basis features that are receptive to the frequency of auditory cues (Aronov, Nevers, & Tank, 2017), or even the size and shape of birds (Constantinescu, O'Reilly, & Behrens, 2016).
In summary, the model describes the formation of place and grid fields in terms the geometric properties and transition statistics of the environment, while providing an efficient platform for goal-directed spatial navigation. This has particular relevance for the neural underpinnings of spatial navigation, although the framework itself could be applied to other basis sets of sensory features.

DATA AVAILABILITY STATEMENT
Specific code can be made available upon reasonable request, and full code for simulations will be made available at https://github.com/willdecothi in due course