Variational log‐Gaussian point‐process methods for grid cells

Abstract We present practical solutions to applying Gaussian‐process (GP) methods to calculate spatial statistics for grid cells in large environments. GPs are a data efficient approach to inferring neural tuning as a function of time, space, and other variables. We discuss how to design appropriate kernels for grid cells, and show that a variational Bayesian approach to log‐Gaussian Poisson models can be calculated quickly. This class of models has closed‐form expressions for the evidence lower‐bound, and can be estimated rapidly for certain parameterizations of the posterior covariance. We provide an implementation that operates in a low‐rank spatial frequency subspace for further acceleration, and demonstrate these methods on experimental data.


| INTRODUCTION
Grid cells in the hippocampal formation modulate their firing rates as a periodic function of location (Hafting et al., 2005;Rowland et al., 2016).Some grid cells are also modulated by head direction (Sargolini et al., 2006; "conjunctive cells"), and recent studies have found more subtle dependence on head direction (Gerlei et al., 2020) and landmarks (Keinath et al., 2018;Krupic et al., 2018), even in nonconjunctive cells.Exploring these relationships requires efficient statistical estimators to compare changes in the spatial dependence of grid-cell activity across conditions.
Standard approaches to spatial statistics have limitations.Gridcell firing-rate maps are often estimated using a Gaussian kerneldensity smoother (e.g., Brandon et al., 2011;Hafting et al., 2005;Killian et al., 2012;Langston et al., 2010).Naïve smoothing approaches remain noisy when data are limited, do not provide a quantification of uncertainty, cannot adapt to inhomogeneous spatial sampling, and cannot take advantage of the periodic structure of gridcell firing.Conversely, approaches based on spatial autocorrelations (e.g., Hafting et al., 2005, many others) reduce noise by averaging over space, but cannot be applied to single grid fields.Gaussian-Process (GP) estimators are a promising solution to these challenges.They offer a principled, Bayesian approach to estimating firing-rate maps.
They incorporate assumptions to improve statistical efficiency, and provide a posterior distribution that quantifies uncertainty.However, open challenges remain in applying existing algorithms to exploratory analysis of large grid-cell data sets.Bayesian priors suitable for grid cells have not been described in the literature, and existing implementations are either limited to specific kernels or are too computationally intensive for large data sets.We resolve both of these issues, and illustrate practical benefits of GP methods compared to non-Bayesian estimators.
We briefly review GP methods in neuroscience, then present (1) a tutorial on applying GPs to grid-cell data; (2) a technical review of approximate inference algorithms; (3) applications of these methods on example data.
Formally, a GP distribution is specified by its mean function μ x ð Þ and two-point covariance function Σ x, x 0 ð Þ, which are analogous to the mean vector μ and covariance matrix Σ of the multivariate normal distribution (see Keeley & Pillow, 2018;MacKay, 1998;Rasmussen, 2003 for a thorough introduction).In computation, however, GPs are almost always represented in terms of a finite-dimensional approximation.We will use the finite-dimensional notation z $ N μ, Σ ð Þ, with the understanding that this represents a particular finite-dimensional projection of our GP model.
Previous works have described GP methods for place and grid cells (e.g., Rad & Paninski, 2010;Savin & Tkacik, 2016;Wu et al., 2017).However, we encountered practical challenges when applying these methods to grid cells in large arenas.Computational efficiency is paramount for exploratory analyses of large data sets.
While scalable solutions exist, the fastest methods require spatial covariance priors that can be described in terms of nearest-neighbor interactions (Cseke et al., 2016;Rad & Paninski, 2010) or a product of rank-1 separable kernels (Savin & Tkacik, 2016).This is not ideal for grid cells, which can display spatial correlations between response fields separated by several centimeters, and which cannot be decomposed into a product of 1D kernels.Recent works have developed ways to approximate the GP covariances that support fast calculations, while remaining expressive (Jensen et al., 2021).
We elaborate upon these ideas, with a particular focus on grid cells, and introduce some new numerical approaches.
Specifically, the new contributions of this manuscript are (1) Tools for designing GP priors that take advantage of the local spatial topography of grid cells; (2) Efficient and expressive variational Bayesian methods; (3) Numerical algorithms with good performance on consumer-grade hardware; (4) A Python reference implementation and example application to grid-cell data.

| RESULTS
We will first review log-Gaussian Poisson models of neural spiking in the context of inferring a grid-cell firing-rate map.These combine a Gaussian prior on (log) firing rate with a Poisson likelihood for spikes.
We review numerical approaches for finding Bayesian posterior, and discuss suitable priors for grid cells, and finally demonstrate applications on example data.

| An example experiment
Throughout this text, we will demonstrate GP methods on data from Krupic et al. (2018), which have also been presented in Chaudhuri-Vayalambrone et al. (2023).Figure 1 illustrates a spatial-navigation experiment (Krupic et al., 2018) in which a rat foraged in a 2 m Â 1 m environment (Figure 1a).Spike counts y t from a grid cell in entorhinal cortex, along with position x t ¼ x 1;t ,x 2;t f g > , were recorded in 20 ms bins, yielding time series X ¼ x 1 , ::, x T f g > and y ¼ y 1 , ::, y T f g > with T samples.Throughout this manuscript, we will denote scalars as lowercase letters "x," column vectors as bold lower-case letters "x," and matrices as bold capital letters "X." The resulting spatial data consists of a map of the number of times the rat visited each location, and the number of spikes observed during each visit.These can be summed on a spatial grid to form occupancy and spikecount histograms, which can be combined to yield a firing-rate histogram (Figures 1b and 4a).In Figure 1, we binned data on a 88 Â 128 grid.

| Estimating a smoothed log-rate map
Our approach will follow variational inference for GP generalized linear models as outlined in Challis and Barber (2013).We consider "latent" GPs, whose values are observed through a firing-rate nonlinearity and neuronal spiking activity.We model the log-firing-rate 4b) as a GP, and spiking events as conditionally Poisson (Figure 2a).This model is sometimes called a log-Gaussian Cox process, after David Cox (Cox, 1955).It captures both correlations and over-dispersion in the covariance structure of z x ð Þ.
We model spike counts within a small-time bin Δt as The choice of an exponential firing-rate nonlinearity λ ¼ exp z ð Þ is useful for obtaining closed-form solutions in variational inference.For simplicity, we will choose time coordinates such that Δt ¼ 1 and omit it going forward.The log-likelihood of observing spike count y given rate λ is then: The overall likelihood of all spiking observations y depends on the logfiring-rate map z x ð Þ, and the animal's trajectory over time x.We assume that the spiking observations are independent conditioned on the log-rates z, so that the likelihood of the overall data set Pr For numerical implementations, we model the function z x t ð Þ as a vector z ¼ z 1 ,::,z M f g , where each z m reflects the value of z x m ð Þ at one of M spatial locations.To make the notation easier to read in these derivations, we will interpret each location z m as a piecewise-constant model of the firing-rate map in a small region of the environment with value (in practice we use linearly interpolated binning for improved resolution; see Section 5.8).
We aggregate time points that fall in the same spatial bin, since these share the same log-rate z m (this is a form of pseudo-point method; Quinonero-Candela & Rasmussen, 2005).We refer to individual bins by a single index m ranging from 1 to M. We denote the tallies of visits to each bin as n ¼ n 1 , ::, n m f g > and the tallies of spikes in each bin as Since ln y! ð Þ does not influence the gradient of (4) with respect to z, we ignore it when optimizing z.Having combined data from repeated visits to the same location, the likelihood in Equation (4) can then be written in vector notation as: This is the log-likelihood of observations ðn, kÞ given z.
This observation model has the same form as a Point-Process Generalized-Linear Model (PP-GLM; e.g., Paninski, 2004;Truccolo, 2016;Truccolo et al., 2005).However, adjusting z to maximize (5) alone will lead to over-fitting.Instead, one can obtain a smoothed map by taking a Bayesian approach.
We can encode constraints like smoothness or periodicity in our choice of the prior Pr z ð Þ.We use a multivariate Gaussian prior Summing the log-likelihood (5) and log-prior (6) yields an expression for the log-posterior of z (up to constant terms): When the dimension of z is large, estimating (7) via sampling or evaluating it on a grid is infeasible.Instead, we approximate the posterior as a multivariate Gaussian distribution.

| Covariance kernels for grid cells
Throughout this manuscript, we assume that the prior covariance between two points Σ z x 1 , x 2 ð Þ depends only on the displacement between them.In this case, the prior covariance takes the form of a convolution kernel.Since we evaluate our rate map on a rectangular grid, and since the prior covariance is a convolution, Σ z is a circulant matrix and products like Σ À1 implementing convolutions via the FFT, it is important to add spatial padding equal or larger than the kernel's radius, to avoid erroneous correlations from the periodic boundary.) How should one select Σ z ?In GP regression, the covariance kernel describes how correlated (or anticorrelated) two points x i and x j in the inferred rate map are expected to be, as a function of the displacement between them: . For any collection of spatial locations, the Σ z induced by the kernel needs to be a valid covariance matrix; Σ z must be positive semidefinite: It should be symmetric, real valued, and have no negative eigenvalues.For our inference procedure to be sensitive to grid cell's periodicity, our kernel needs a periodic structure.A hexagonal map with period P and orientation θ 0 can be defined as the sum of three cosine plane waves, rotated at π=3 radians from each-other: The ideal grid (8) is a valid kernel function: It is symmetric, and its Fourier transform consists of all nonnegative real coefficients.
We also use a radially symmetric kernel (Figure 3a-3,4) for analyzing grid-cell period in an orientation-agnostic manner.We can construct a radial kernel by considering a ring of spatial-frequency components ξ ¼ ρe iω that match the spatial frequency ρ ¼ 1=P of the grid, or, equivalently, a radially averaged version of (8).In this spatial domain, this kernel is the zeroth-order Bessel function of the first kind, This kernel is more general: It does not require a fixed, global grid orientation, and can be applied to cells with fields separated by a characteristic distance, but no global lattice (as seen in the entorhinal cortex of bats, Ginosar et al., 2021-although in 3D the radial kernel (9) takes the form K r r ð Þ ¼ sin 2π P r The zeros of (9) provide rule-of-thumb cutoff radii for various degrees of spatial interaction: The first zero corresponds to single fields, the second to an inhibitory surround, and the third to nearestneighbor interactions.In this work, we truncate kernels to nearest- where k 3 ≈ 8:65 is the third zero of J 0 .We apply a circular window W Δx ð Þ¼ϑ jΔxj À r c ð Þ(ϑ is the Heaviside step function), remove high spatial frequencies from the kernel by applying a 2D Gaussian-blur K σ with radius σ ¼ P=π, and finally truncate any resulting negative Fourier coefficients to zero.
This heuristic procedure provided good spatial locality while limiting the kernel to the spatial frequencies of interest; we do not exhaustively compare possible kernels here, but do provide other windowing methods and options to control kernel anisotropy in the reference implementation.
We introduce scale (σ 2 0 ) and constant offset (c) parameters to control the kernel's marginal variance, and the variance assigned to the mean-log-rate component, respectively.Using either a grid or radial kernel as a base kernel (K 0 ) we define the parameterized kernel K Θ as: (b) (c) where Ã denotes convolution and Á pointwise multiplication.We discuss hyperparameter selection in Section Optimizing kernel hyperparameters.
Generally, one can construct suitable kernels by computing the autocorrelation of a prototype firing-rate map, averaging to achieve any desired symmetries, and applying desired spatial or spectral windowing.If the kernel is defined as a convolution over a regular grid, these operations can be computed quickly using the FFT.Since any product, convolution, or nonnegative linear combination of positivesemidefinite kernels is also positive semidefinite, complicated kernels can be constructed out of simple primitives.

| Variational inference
One can optimize the log-posterior (7) in z to obtain a smoothed firing-rate map.This is known as the maximum a posteriori (MAP) estimator (Figure 2b; see Section 5.1).The MAP estimator allows us to specify prior assumptions (e.g., smoothness and periodicity) by selecting the appropriate prior covariance Σ z .
However, it is important to assess our confidence in the resulting rate map, and to have a formal way of checking whether our prior is reasonable.Variational Bayesian methods provide a formal way to approximate posterior uncertainty, in the form of a GP covariance function.
In variational Bayesian inference (Figure 2c), we approximate the true posterior with a simpler distribution "Q ϕ z ð Þ" defined by some parameters ϕ.We use a multivariate Gaussian approximation here, so Variational inference selects ϕ by maximizing a quantity called the evidence lower bound.This is equivalent to simultaneously minimizing the Kullback-Leibler divergence "D KL " from the prior to the posterior, while maximizing the expected log-likelihood ( 5) where Á h i denotes expectation with respect to Q ϕ .
The first term in (12) reflects the information gained by revising our estimates of z compared to our prior beliefs.Since both Q ϕ z ð Þ and Pr z ð Þ are multivariate Gaussian, this term has the closed form: The second term in ( 12) is the expectation of our Poisson observation model ( 5) with respect to where we abbreviate exp z ð Þ as λ.We can write the overall objective "L" to be maximized as: The term "tr Σ À1 z Σ À Á " encourages the posterior covariance to be close to the prior, and the term "À ln j Σ j" encourages the posterior to have large entropy.
A convenient property of the log-Gaussian-Poisson model is that the expected firing rate λ h i (Figure 4c) required to calculate (15) has a closed form.Since we have assumed a multivariate Gaussian distribution for z, and since λ ¼ exp z ð Þ, the firing-rate λ is log-normally distributed.The expectation λ h i is the mean of this log-normal distribution, and has the closed-form expression To simplify notation, we define "λ" as the expected rate (with dependence on μ and Σ implicit), corrected for the number of visits in each We discuss numerical approaches for calculating ( 16) briefly in the next section, and in more detail in Section 5.6.

| Optimizing the variational posterior
With these preliminaries out of the way, we now consider the derivatives of ( 15) in terms of μ and Σ.These can be computed using modern automatic differentiation tools (e.g., Jax; Bradbury et al., 2018).However, substantial speedups are possible by considering the analytic forms of the derivatives, and identifying simpler ways to calculate them.
The gradient and Hessian of (15) with respect to μ are respectively.The derivative of (15) in Σ is more involved (see Section 5.4, Equations 33-35): Optimizing the full M Â M posterior covariance is impractical.Typically, one chooses a simpler parameterization.Combinations of low-rank factorizations and Toeplitz or circulant matrices are common (Jensen et al., 2021).In our case, an exact lowdimensional parameterization of the variational posterior covariance is available (Challis & Barber, 2013;Seeger, 1999;Equation 10).Note that the stationary point of ( 18) occurs when . This means that all variational posterior covariance matrices can be parameterized by a diagonal update diag λ Â Ã to the prior precision matrix Σ À1 z .We parameterize this update by the vector q ¼ q 1 , ::, q M f g , and seek a self-consistent solution q ¼ λ: This models the posterior precision as a sum of the prior precision, plus information provided by observations at each location.
We obtain the gradient of L in q from (18) and ( 19) using the chain rule (See Section 5.4, Equations 35-37): This gradient is zero when q ¼ λ.If Σ is full rank, this zero is unique, and one may optimize q by ascending the much simpler gradient λ À q, which has the same fixed point.We maximize the evidence lower bound (15) by alternatively updating μ and q.Updates to μ are similar to finding the MAP estimator.We optimize the posterior covariance for fixed μ via an iterative procedure that amounts to setting q λ repeatedly (see Section 5.7).
There is one remaining difficulty to address.Calculating the expected firing rate ( 16) requires computing diag Σ ½ .These are the marginal variances of the firing-rate at each location.For the parameterization in Equation ( 19), one must compute We calculate this using a low-rank approximation of the posterior covariance in Fourier space (See Section 5.5).
We summarize all steps of this iterative procedure in pseudo-

| Optimizing kernel hyperparameters
The prior covariance kernel in Equation ( 10) depends on unknown hyperparameters Θ: period "P," scale "σ 2 0 ," and mean offset "c" (and, for grid kernels, orientation "θ 0 ").The variational Bayesian framework provides a principled way to optimize these.To evaluate the quality of hyperparameters, one first optimizes the variational posterior using the kernel determined by Θ.At the optimized Q ϕ z ð Þ, Equation ( 12) lower-bounds the likelihood of the data for the chosen hyperparameters.This allows one to compare the quality of different choices of Θ (e.g., Figure 3b).We optimized Θ using a hill-climbing grid search, starting from a heuristic guess (see Section 5.9).

| Sampling spatial statistics
Once obtained, the GP posterior can be used to sample the distribution of likely firing-rate maps.For example, one may wish to obtain the probability distribution of the peaks of individual grid fields (Figure 4d).

Given a Gaussian posterior
Þis a vector of M Gaussian random numbers with unit-variance and zero-mean.However, obtaining Σ 1=2 is impractical for large M. Sampling in the low-rank (D < M) space e z $ N e μ, e Σ is efficient (See Section 5.5).Samples can be drawn as where e R maps samples from the low-rank subspace into the full (spatial) representation, and is described in ( 41) and ( 42).The factor e Σ 1=2 can be calculated as e , where Figure 4d uses sampling to visualize uncertainty in grid-field locations.We generated a peak-density map by plotting the fraction of samples that contain a local maximum within a radius of P=2, where P is the grid cell's spatial period.We segmented the arena into Voronoi cells associated with each grid field (out to a maximum radius of 70% P), and calculated 95% confidence ellipses by fitting a 2D Gaussian to each segmented grid field's peak distribution.

| Peak-location confidence intervals
For well-identified grid fields, one can calculate confidence intervals from the posterior distribution using a locally quadratic approximation.Consider a local maximum in the posterior mean μ x ð Þ at location x 0 .How much does x 0 change if a perturbation , sampled from the posterior covariance, is added This can be calculated via a Taylor expansion in Δ The slope at x 0 is zero, since it is a local maximum, so a Taylor expansion out to second order has only 0th-and 2ndorder (curvature) terms.The curvature in x is defined by the Hessian matrix H z ≔ r x μ x 0 ð Þr > x .Out to second order our grid-field log-firing-rate is: Now, add a first-order approximation ε Setting the derivative of 24 ð Þ in Δ x to zero and solving for Δ x , we find that: We can construct a covariance matrix "Σ Δx " for the location of the peak using (25).
We use the low-rank approximation Σ ≈ e R e Σ e R > as in (22), where e Σ ℝ DÂD is the low-rank covariance and e R ℝ L 2 ÂD is a semi-orthogonal operator defining our low-rank basis.We can use the Cholesky decomposition to obtain Q ℝ DÂX such that e Σ ¼ QQ > and calculate (26) as Figure 4d compares field-location confidence intervals obtained either by sampling, or quadratic approximation.These methods agree for well-localized peaks.

| Head-direction dependence
In Figure 5, we show two ways to use LGCP regression to estimate head-direction dependence in grid cells.First, we partitioned the 30-min recording session into subsets, with sample weights (Figure 5a,b) defined as ALGORITHM 1 : Iterative procedure for variational-Bayesian log-Gaussian Cox process regression; ∘ denotes elementwise vector and matrix products, with u ∘ A ≔ diag u ½ A and A ∘ u ≔ Adiag u ½ , and Á ð Þ ∘ Á denotes element-wise power.
This weighting separates data from opposing head directions estimator over a range of head-direction angles ϕ 0 0, 2π ½ Þ reveals a continuous and smooth dependence of grid field peaks on head direction (Figure 5c).Opposing directions (cardinal directions shown in Figure 5d) show clear differences.
Second, we inferred position and head-direction tuning jointly by adding head direction as a third axis to the LGCP regression.To facilitate comparison with Figure 5a-d, we used the same weighting function (28), adjusted to make it positive semi-definite "K ϕ " (see Section 5.10; Figure 5e).The full 2D + direction kernel was a tensor product x with a position kernel K x (an optimized version of kernel; Figure 3a-2).The resulting posterior provides a joint distribution of 2D + direction tuning curves, visualized qualitatively in Figure 5f with head-direction mapped to hue.As in Figure 4d, one can obtain the distribution of grid-field peaks-in this case conditioned on head direction (see Section 5.10).This posterior peak-density map (Figure 5g) recapitulates the head-direction dependence found from applying separate regressions to sub-sampled data (Figure 5c).

| Estimator performance
We quantify the advantages of LGCP regression over naïve kernel density estimators (KDEs) in Figure 6.We evaluated the estimator performance on a simulated grid map and 30-min recording session (Figure 6a) similar to Krupic et al. (2018).On simulated data, the LGCP estimator (optimized grid kernel; Figure 3a-2) was more accurate than the KDE for a given amount of training data, exhibited less bias than a KDE with bandwidth matching the grid-field scale, and exhibited less variance than a finer-scale KDE (Figure 6b-d; see Section 5.11).
We tested the ability of LGCP regression to predict neuronal activity under cross-validation (Figure 6e,f).We stress, however, that the application of LGCP regression is not to predict neuronal activity exactly, but rather to infer larger-scale features of the grid map by discarding irrelevant fine-scale detail.Nevertheless, the calibrated LGCP estimator consistently matched or exceeded the predictive performance of a kernel-density estimator with bandwidth matched to the grid scale (see Section 5.11).
We show two measures of performance in Figure 6e,f: The expected log-likelihood of held-out test data under the regressed LGCP posterior, relative to the log-likelihood of a KDE (Figure 6e), and the same results in terms of normalized explained deviance (see Section 5.12).

| DISCUSSION
We have introduced a variational Bayesian approach to analyzing data from grid cells.We focused on challenges associated with working LGCP analysis of joint position-head-direction tuning.We examined head-direction tuning in a cell from Krupic et al. (2018) by conditioning on subsets of the data (a-d), and via estimation of a joint log-rate posterior (e-f) (see Section 5.10).(a) The LGCP's efficiency makes it practical to compare changes in the rate map between subsets of the experimental data.We separated opposing head directions into nonoverlapping subsets weighted by cosine similarity between the rat's head direction and a reference direction.(b) The rat's smoothed head direction, denoted via line segments (colored by nearest cardinal direction) stemming from the smoothed position trajectory (black).(c) In this cell, the posterior rate-map peaks depend smoothly on head direction.(d) Comparing opposing head directions reveals directionality.(e) One can also estimate position and head-direction tuning jointly.Here, we clipped negative Fourier components of the squared-cosine weighting in (a) to form a positive-semidefinite head-direction kernel (shown; 24 direction bins).(f) The inferred 2D + heading rate map, depicted in a qualitative color scheme, with preferred head-direction mapped to hue.(g) The peaks in the sampled 2D + heading posterior conditioned on each head direction recapitulate the directional shifts seen in Figure 5c.
with grid cells in larger environments, and prioritized computational efficiency to facilitate exploratory analysis of large data sets.Our method incorporates prior assumptions of grid-cell periodicity, is computationally and statistically efficient, yields approximate confidence intervals, and provides way to compare different prior assumptions (i.e., optimize the kernel hyperparameters).

| Caveats
The posterior covariance of a GP regression is only interpretable if the kernel is a true model for the data's correlation structure.To guard against mis-specified kernels, we recommend robust controls for formal hypothesis testing, such as shuffle tests that remove a purported effect form the underlying data.These are computationally intensive but feasible, requiring resources similar to shuffle controls for a large GLM regressions.
We have assumed that a single kernel captures the correlation structure at all locations, whereas grid cells are known to display subtle changes, for example, near boundaries (Hägglund et al., 2019).While it is possible to relax the assumptions encoded in the kernel (e.g., by summing multiple angular/radial kernels to create orientation/period flexibility), a spatially varying solution may be preferable.Research into non-stationary (or spatially inhomogeneous) kernels is ongoing (e.g., Paun et al., 2023), but we are aware of no suitable advances for the specific problem.In principle, one could merge maps from different priors in different regions of the arena post hoc.We leave such explorations for the future.

| Generality and applicability
The GP estimators described here combine aspects of traditional kernel-density smothers and Poisson generalized linear models with a principled (periodic) prior.The advantages of GP regression over the KDE are that GP regression (1) adapts to smooth more where data are limited (2) can be used to infer parameters of the spatial correlation structure (e.g., identify grid period), and (3) can employ priors with a smoothing radius that exceeds the size of a single grid field.
Our implementation is broadly applicable to problems that are large enough for the cubic complexity of naïve GP estimators to become burdensome, but which are unsuitable for existing sparse or low-rank LGCP and the two KDEs (λ 0 = true rate, b λ = estimate).(e, f) We quantified cross-validated (10-fold) estimator performance on 15 randomly selected cells with at least five grid fields from Krupic et al. (2018), and on the simulated data (black Â).We estimated kernel parameters and the posterior rate map, and measured the expected log-likelihood of held-out data.We compared performance to a KDE smoother matched to the grid-field scale (Figure 6b, middle scenario).(e) Cross-validated LGCP expected log-likelihood (adjusted; see Section 5.11), relative to KDE loglikelihood baseline.We explored three kernels: A Gaussian kernel (the same one used by the KDE), a radial kernel (kernel Figure 3a-4), and a grid kernel (kernel Figure 3a-2).The grid kernel fared worse using heuristic hyperparameters (light bars), but had superior performance once optimized (dark bars).(f) The same analysis as (e) reported in terms of % explained deviance (see Section 5.12).
algorithms.It can be applied to intermediate-size spatiotemporal inference problems with kernels that are sparse in the frequency domain.
Although we focused on the Poisson observation model, our approach generalizes to other observation models in the natural exponential family.The main caveat is other observation models may lack a convenient closed-form expression for the expected firing rate, and these terms may need to be approximated via sampling.

| Future work
The methods we have described are suitable as a drop-in replacement for the smoothing and KDEs currently used to analyze grid cell activity.
In addition to providing a principled smoother that is aware of a grid cell's spatial scale, our approach provides approximate confidence intervals.
These algorithms have considerable potential, and could be extended, for example, to incorporate tuning to additional behavioral covariates, or additional latent rate fluctuations that are correlated in time.
We plan to apply these methods in our own work.We hope that others will as well, in addition to building upon these approaches to design new algorithms for spatiotemporal statistics in the study of spatial navigation.

| Finding the posterior mode
One approach to estimating the firing rate map is to find the z that maximizes the log-posterior (7).This is the MAP estimator, and can be solved via gradient ascent.The gradient of (7) in z is: where we have defined e λ ≔ n ∘ λ as the vector of estimated firing-rates weighted by the number of visits to each location n ( ∘ denotes element-wise multiplication).
The (negative) log-posterior ( 7) is convex, and well approximated as locally quadratic.The Newton-Raphson method is applicable, and much faster than gradient descent.This uses the curvature (Hessian, "r z r > z ") of ( 7) in z: On each iteration, Newton-Raphson must solve the update The Hessian and Jacobian for the MAP estimator are the same as those for the variational posterior mean ( 17), but with the expected rate λ replaced by the point estimate e λ.To compute H À1 ẑ J ẑ quickly in high dimensions, we used an inexact Newton-Raphson method (Dembo et al., 1982) that approximates H À1 ẑ J ẑ each iteration of (31) via a preconditioned Krylov method (see Section 5.2).
Algorithms combining Newton-Raphson iteration with Krylov methods were first developed for very large, sparse, GP models (Cseke et al., 2016;Kiiveri & De Hoog, 2012), but apply any problem where Au can be computed quickly but A À1 u is impractical.In our case, our prior covariance Σ z is a convolution, and we calculate Σ À1 z u quickly as point-wise multiplication in in the spatial-frequency domain (see Section 5.5).We can therefore calculate the Hessian-vector product H ẑu quickly, which allows us to expediently calculate H À1 ẑ J ẑ using a Krylov-subspace solver.
In our tests, we found that Scipy's (Virtanen et al., 2020) implementation of the minimum residual Krylov-subspace algorithm (MINRES; Paige & Saunders, 1975) provided the best balance of speed and stability.To make complex-valued spatial-frequency components compatible with Krylov solvers designed for real-valued matrices, we used the Hartley transform rather than the Fourier transform (see Section 5.5).
Krylov-subspace algorithms benefit from a preconditioner "M" that approximates the inverse (H À1 ẑ in our case).We used the prior covariance kernel for this, M ¼ Σ z , also computed as a convolution via pointwise multiplication in a low-rank spatial-frequency domain (see Algorithm 1).

| Connection to the Laplace approximation
The Laplace approximation (Figure 2b) models the posterior uncertainty in the MAP-estimated log-rate b z as a Gaussian centered at μ ¼ b z, and with the covariance equal to the negative-inverse of the Hessian (30) evaluated at b z: Intuitively, (32) says that the Laplace approximation models the posterior precision as a sum of the prior precision and a diagonal matrix representing information from spiking observations.Note the similarity between the derivatives of the variational mean ( 17) and those of the MAP estimator ( 29) and ( 30), and the similarity between the Laplace-approximated posterior variance (32) and the covariance update for variational Bayes ( 19) and ( 45).
Optimizing the variational mean is tantamount to calculating the MAP estimator using the expected rate λ h i z rather than a point estimate λ.Likewise, updating the posterior covariance is tantamount to applying the Laplace approximation at the variational mean, again using the expected rate rather than a point estimate.

| Derivatives
In this section, we derive the gradients of the evidence lower bound (15) with respect to Σ and q (Equations ( 18) and ( 20) in the main text, respectively).
First, we obtain the derivative of the evidence lower bound ( 15) with respect to the posterior covariance matrix Σ.Consider the derivative of the term n > λ h i with respect to individual elements Σ ij .We use Einstein summation notation, wherein sums over repeated indices are implied: where δ ab is the Kronecker delta (1 if a ¼ b and 0 otherwise).
The derivative of the term tr 15) is given by identity (100) in The Matrix Cookbook (Petersen & Pedersen, 2008), and is The derivative of the term ln j Σ j is given by identity (57) in The Matrix Cookbook (Petersen & Pedersen, 2008), (assuming j Σ j ≠ 0), and is Σ À1 .Overall, then, the derivative of (15) in Σ is: plifies to: We can now obtain the derivative of the evidence lower bound (15) in q via the chain rule, To obtain ∂ q k Σ ij , we will need identity (59) in The Matrix Cookbook (Petersen & Pedersen, 2008), which provides the chain rule for the derivative of the inverse of a matrix, Combining ( 35) and (36) gives:

| Working in a low-rank subspace
Working in a low-rank subspace can make large problems tractable.
We first find a low-rank approximation of the prior covariance Σ z , and then perform inference within this subspace.
The prior covariance Σ z is defined by a convolution kernel.The components of this kernel "ξ" in spatial-frequency (Fourier) space are the eigenvalues of Σ z : where F is the unitary Fourier transform and † denotes the conjugate (Hermitian) transpose.
In practice, many spatial frequency components will be close to zero.These are frequencies where the prior assigns very little probability.We work in a low-rank space consisting only of those directions in Σ z where the prior has assigned non-negligible variance.We retain the D ≤ L 2 components " e ξ" whose magnitude in the prior covariance kernel is at least 10% of the eigenvalue of the prior covariance with the largest magnitude "ξ max ": The low-rank approximation to the posterior covariance can then be calculated as where F is the (unitary) Fourier transform retaining only the non-negligible components e ξ.F is not invertable, but since it is semi-orthogonal, F † is its pseudoinverse.
Note that the Kullback-Leibler divergence contribution to the evidence lower bound (13) contains a constant factor À 1 2 M that depends on the number of dimensions M ¼ L 2 in the our multivariate-Gaussian prior.When working in a low-rank D < M subspace, this term should be replaced by À 1 2 D to ensure that the evidence lower-bound can be compared between models with low-rank subspaces of different ranks.
Fourier coefficients can take on complex values.This creates compatibility and performance issues with standard numerical linear algebra software.To address this, we use a real-valued relative of the Fourier transform called the Hartley transform (Hartley, 1942).
We denote the Hartley transform as R, and the transform with negligible frequencies discarded as e R. The Hartley transform is calculated by summing the real and imaginary components of the Fourier transform If F is the unitary Fourier transform, then R is also unitary.
Equations in ( 38 We denote this approximation as e We perform most calculations in this low-rank space, and never explicitly construct the posterior covariance.The only calculation that cannot be performed in the low-rank space is the calculation of the expected firing-rates at each location, λ, which we address in Section 5.6.This has complexity , where N is the number of spatial bins containing observations.
When operating in a low-rank subspace, it is important that the mean of the variational posterior also be expressed in this subspace.
Leaving μ in the full-rank space creates a poorly conditioned problem, since several directions will be ignored when calculating gradients using low-rank approximations.

| Calculating the expected firing rate
For Gaussian z $ N μ, Σ ð Þ and exponential firing-rate nonlinearity, the Rule & Sanguinetti, 2018).Evaluating this expression requires the diagonal of the posterior covariance matrix.
In the low-rank subspace (42), these diagonal elements Σ ii can be calculated with the following procedure: In (43), we first project the (square root of the) precision update diag q 1=2 Â Ã into the low-rank subspace.We then obtain the inverse posterior covariance in the low-rank space, "Λ."Rather than invert this directly, we compute its Cholesky factorization and use a triangular inverse solver.We expand this from the low-rank subspace using the inverse FFT.This provides a factor e Σ 1=2 of the low-rank approximation to the posterior covariance such that e which we extract the diagonal variances.This factor is also useful for sampling from the variational posterior (22).

| Iteratively estimating q
From (20), we see that q must equal λ to maximize the evidence lower bound (15) (for fixed μ).However, since λ depends on q, this must be solved self consistently.This can be solved by ascending the simpler gradient Δq / λ À q.Taking discrete steps yields the following fixedpoint iteration: In practice, we implement this by iterating the marginal posterior variances v ¼ diag Σ ½ .This is amounts to a different parameterization of (44): Challis and Barber (2013) note that the iteration in (44) may diverge.In practice, we have found that the reparameterized iteration in (45) always converges when starting from v ¼ 0, provided one re-optimizes the posterior mean before each step according to (17), and provided the prior is sufficiently well-conditioned.Note that Σ v t ð Þ remains bounded in the parameterization in (45) as 0 ≼ Σ v t ð Þ≼ Σ z .This implies that the iteration cannot diverge to infinity if the prior precision Σ À1 z is full rank ( ≼ is the Loewner order of positive semidefinite matrices).Since these iterations are simply gradient descent with a step size of 1, instability can be remedied by choosing a smaller step-size, if encountered.

| Binning data
For clarity, we presented the derivations in this manuscript in terms of piecewise-constant spatial basis functions.In practice, linearly interpolated binning provides better resolution for a given grid size, and this is what we used in the provided reference implementation.
For each visit and/or spike at location x, we distributed the point mass at x over a 2 Â 2 neighborhood of adjacent bins via linear interpolation.This amounts to using square-pyramidal basis functions to provide a piecewise-linear model the inferred firing-rate map (compare to fig. 1 in Cseke et al., 2016).
Since most calculations are performed directly on the spatialfrequency components of the grid map, choices for spatial binning only affect the numeric integration of the data likelihood (5) over the spatial domain.Locally constant binning amounts to using the Riemann sum to compute this integral, and linearly interpolated binning amounts to using the trapezoid rule.Linear interpolation improves resolution compared to a piecewise-constant model, but conceptually there are no substantive differences.

| Grid period P and orientation
We estimated the grid period P using the radial autocorrelogram of the firing-rate histogram y ¼ k=n, calculated by averaging the 2D spatial autocorrelogram over all angles.The radial autocorrelogram "R ρ " for a period-P periodic spatial signal is given by Equation ( 9): The location "Δ p " of the first nonzero peak of R ρ k Δx k ð Þdepends on P, and we can solve for P given Δ p as where k 1,2 is the second zero of the first-order Bessel function of the first kind.
When using the oriented kernel (Figure 3a-2), we estimated the grid orientation based on the phase of a 6-fold periodic sinusoid fit to the spatial autocorrelogram at distance r ¼ P= 2πk 1,2 ð Þ .
5.9.2 | Heuristic μ and prior mean μ z We used a Gaussian kernel density smoother to estimate foreground We use this background log-rate map for the prior mean μ z in variational inference.We used the foreground as an initial guess when optimizing the posterior mean.
5.9.3 | Kernel height σ 2 0 and constant offset c We calculated an initial estimate of the log-rate as the difference between the log-foreground and log-background maps.The variance of this map was then used to initialize the kernel height, σ 2 0 .The kernel's constant offset c controls how confident we are in our prior assumptions about the average log-firing-rate across the environment.The average log-rate is the average ("DC") component of the prior mean μ z .We set c ¼ 10 3 to leave the inference procedure free to adjust the mean log-rate.

| Grid search
For the analyses shown in this paper, we refined kernel hyperparameters in a two-step process.Starting from heuristically initialized parameters, we estimated P, σ 2 0 À Á via grid-search with an orientationagnostic kernel (Figure 3a-4).We recursively searched nearby values of Θ until we found a local maximum.We re-used solutions for the parameters of the variational posterior from previous choices of Θ as initial guesses for optimizing new Θ to reduce computational cost.

| Head-direction analyses
For Figure 5, head direction was tracked via a head-mounted infrared LED (see Krupic et al., 2018 for details).We converted the recorded head direction ϕ raw t ð Þ into cosine and sine components, .We used data from the entire experimental session to optimize the period, variance, and direction of a local-neighborhood grid kernel (kernel; Figure 3a-2) via grid search.We used this spatial kernel "K x " for all subsequent regressions.
Analyzing head-direction via weighted subsets of the data reduces to the 2D inference problem.For each reference direction ϕ 0 , we defined a weighting function w t 2 (Equation ( 28)).We calculated weighted visit counts n ϕ 0 ;m ¼ P t s:t: x bm w t and spike counts k ϕ 0 ; m ¼ P t s:t: x bm y t w t (compared to Equation 4).Inference of a heading-conditioned rate map amounts to inferring a 2D position rate map using these weighted counts.
To construct joint 2D + direction LGCP regression (Figure 5e-g), one treats the time-varying head direction as a third spatial dimension.
The only difference from the spatial case is that the head-direction axis does not require padding to avoid circular wrap-around.We defined a grid of D ¼ 24 head directions uniformly spaced around the circle, and binned the smoothed head direction using linear interpolation (see Section 5.8).
To facilitate comparison between approaches, we modified the weighting function (28) into a positive semidefinite kernel by clipping its negative eigenvalues to zero, that is, K ϕ ϕ, ϕ 0 ð Þ¼F À1 max 0, F w ϕ, ϕ 0 ð Þ f g ½ .
We constructed the joint kernel K ϕx ¼ K ϕ N K x as a Kronecker product in the spatial domain, then discarded all but the D ¼ 1000 largest Fourier components to generate a low-rank subspace.Inference of the posterior log-rate density is then identical to the 2D case.Unlike Savin and Tkacik (2016), we do not use the Kronecker structure of the (direction N position) prior in the inference, but rather infer the joint posterior in a low-rank subspace.
We calculated the head-direction-dependent peak-density map in Figure 5g by drawing 2D + direction samples from the inferred posterior distribution, conditioning on each head-direction separately, and identifying local maxima with a radius of P=2:5 of the grid period P (with peak locations up-sampled via quadratic interpolation).

| Assessing estimator performance
In Figure 6, we assess the LGCP estimator performance on both simulated and experimental data.We simulated an ideal grid cell on a 90 Â 90 grid with log-rate as in ( 8), scaled to a mean rate of 1.2 Hz, and with a spatial period of 13 bins.For comparison to the experimental results throughout the text, if each bin were 2 Â 2 cm 2 , this would correspond to a 1.8 Â 1.8 m enclosure and a cell with a period of 26 cm.We simulated 30 min of random exploration at 50 samples per second as Brownian motion (σ 2 = 0.02 bin 2 /s) clipped to the arena boundaries, filtered twice with a first-order exponential smoother (τ ¼ 190 ms).
We compared the accuracy, bias, and variance of the LGCP and KDE in Figure 6b-d.We defined a "scale-matched" Gaussian KDE with variance σ 2 0 ¼ P 2 = 2π 2 À Á .This matches the curvature of the Gaussian kernel at Δx ¼ 0 with that of the radial autocorrelation (9), yielding a Gaussian kernel that approximates the size and shape of a single grid field.We also defined a "finer-scale" KDE kernel, with variance σ 2 ¼ σ 2 0 =8, which was more noisy, but provided a less biased estimate in expectation.To assess accuracy, bias, and variance as a function of data size (i.e., recording length), we partitioned the synthetic data into 15 blocks and sampled bootstrapped training data sets of varying duration, with replacement (200 samples).We kept the kernel parameters fixed (for all estimators), rather than re-estimating them on each sample (see Section 5.12 for an assessment that incorporates hyperparameter uncertainty).

| Cross-validated performance measures
We compared the ability of the LGCP and scale-matched We compared three different kernels for the LGCP estimator: (i) A Gaussian Radial Basis Function (RBF), with variance σ 2 0 ¼ P 2 = 2π 2 À Á identical to that of the KDE; (ii) A radial kernel (Figure 3a-4), which included no assumptions about grid orientation, and (iii) A local-neighborhood grid kernel (Figure 3a-2).The Gaussian kernel provides a fair comparison with the KDE, and the radial versus grid-kernel performance emphasizes the importance of hyperparameter optimization.
We assessed performance under 10-fold cross-validation.We tested both heuristic (see Section 5.9) and grid-search-optimized kernel hyperparameters.Hyperparameter estimates were repeated for each block with held-out data excluded.The reference KDE bandwidth was fixed at the cells "true" period as identified by the optimal kernel parameters on the whole data set.
We assessed LGCP performance using the expected loglikelihood ( 14) of the held-out test data under the inferred posterior distribution (or, for the KDE: the point estimate ( 5)).Since changes in mean-rate between train/test data are uninteresting for inferring spatial variations in tuning, we adjusted the predicted mean-rate to match the test data before evaluating the (expected) log-likelihood ("adjusted log-likelihood").We also report performance in terms of change in the % explained deviance, in analogy to a normalized R 2 statistic from linear regression.We defined the "null" model as one that simply guesses the mean-rate on the test data b λ null ¼ y test h i (worst-case performance), and the "saturated" model as b λ saturated ¼ y test (theoretical maximum of the Poisson likelihood).Normalized explained deviance is given as where L are the (expected) log-likelihoods of the respective models.
We report the improvement in (48) relative to the KDE baseline (Â100%) in Figure 6f.

zz
À μ z ð Þcan be computed using the Fast Fourier Transform (FFT) in O Mlog M ð Þ ð Þ time.(Note: when An example experiment.(a) In this experiment, a rat foraged in a 2 m Â 1 m open environment (left).The rat's position over time "x" (right, top), as well as spike counts "y" from a single neuron in entorhinal cortex (right, bottom) were recorded (data from Krupic et al., 2018).(b) A firing-rate histogram (right, k=n) can be estimated by dividing the total number of spikes tallied at each location "k" (left) by the number of visits to each location "n" (middle).(Color scales are not quantitative.) Periodic priors to infer grid-cell maps.(a) Periodic kernels suitable for grid cells: Each plot shows the kernel's 2D Fourier spectrum (left), spatial domain representation (center), and an example rate map sampled from the kernel (right).(1, 2): Oriented kernels are selective for the grid cell's preferred spatial orientation.(3, 4): Radial kernels based on the Bessel function include no prior assumptions about grid orientation.(b) Kernel parameters, like grid scale, can be selected by choosing the kernel that gives the best Evidence Lower Bound (ELBO) after fitting the posterior rate map.Shown here are the loss functions for the period and variance of an oriented grid kernel (Figure 3a-2) for the cell in Figure 1.
F I G U R E 4 Inferring grid-cell firing-rate maps with LGCP regression.(a) Rate histograms from three example cells fromKrupic et al. (2018).(b) Posterior log-rate map from LGCP inference using the optimized grid-cell kernel (Figure3a-2).(Background variations in log firing-rate not included.)(c) Expected firing rate calculated using (16) from the variational posterior.(d) 95% confidence intervals for field location calculated either using a locally quadratic approximation (purple) (26) or sampling (teal) within each grid-field's Voronoi region (all points closer to a given field than any other, a no further away than 70% of the grid period), overlaid on the probability density of grid-field peaks (shaded).
code in Algorithm 1.The key takeaways regarding the numeric implementation are this: (1) The posterior mean can be optimized readily using Newton-Raphson iteration, in much the same way as one might estimate the posterior mode for a log-Gaussian-Poisson generalized linear model; (2) The ideal parameterization of the variational posterior covariance takes the form of a diagonal update to the precision matrix, which reflects the amount of information available at each spatial location.This can be updated by a straightforward fixed-point iteration reminiscent of the Laplace approximation (See Section 5.3).
Compared to kernel density estimators, LGCP regression is more efficient and exhibits a superior bias-variance trade-off.(a) We sampled Poisson spiking activity from a synthetic grid cell (mean rate 1.2 Hz) throughout 30 min of simulated foraging in a square arena.(b) Comparison between rate maps recovered via Log-Gaussian Cox Process (LGCP) regression (black, left), a Kernel Density Estimator (KDE) with a kernel width matching the scale of a single grid field (red, middle), and a KDE estimate using a narrower kernel (blue, right).(c) The LGCP estimate correlates well with the ground truth, exhibiting less bias than a scale-matched KDE estimator, and less variance than a narrow one.(d) Comparison of accuracy (left; Pearson's correlation between the estimated and recovered rate), variance (middle), and bias (right), between the )-(43) work similarly with the Hartley transform, replacing F with R, and replacing Hermitian transposes with ordinary transposes.Since the (circulant) prior Σ z is symmetric, its Fourier coefficients ξ are real-valued, and the Hartley-transform coefficients for the prior are identical to the Fourier coefficients.We can write (40) using the Hartley transform as Σ.During inference, only N < M bins with nonzero observations (n m > 0) contribute to the expected loglikelihood, and we can further truncate e R to an N Â D matrix for efficiency.The frequency-subspace representation simplifies some of the matrix calculations.Let L be the size of the environment, and L 2 be the total number of spatial bins.A L Â L array can be converted into frequency space using the 2D fast Fourier transform, which costs O L 2 log L ð Þ .If we retain only D components, the relevant transform has dimensions L 2 D and the cost is O L 2 D .Ordinary matrix multiplication can outperform the FFT when D $ O log L ð Þ ð Þ .Since grid cells display only a narrow range of spatial scales, D can be small, and the complexity of each optimization iteration is competitive with simpler estimators.
predict spiking activity on held-out test data in Figure 6e,f.We used a simulated data set and 15 randomly chosen cells from Krupic et al. (2018) (of those with at least five grid fields).