A Hierarchical Spatio-Temporal Analog Forecasting Model for Count Data

1. Analog forecasting has been successful at producing robust forecasts for a variety of ecological and physical processes. Analog forecasting is a mechanism-free nonlinear method that forecasts a system forward in time by examining how past states deemed similar to the current state moved forward. Previous work on analog forecasting has typically been presented in an empirical or heuristic context, as opposed to a formal statistical context. 2. The model presented here extends the model-based analog method of McDermott and Wikle (2016) by placing analog forecasting within a fully hierarchical statistical frame- work. In particular, a Bayesian hierarchical spatial-temporal Poisson analog forecasting model is formulated. 3. In comparison to a Poisson Bayesian hierarchical model with a latent dynamical spatio- temporal process, the hierarchical analog model consistently produced more accurate forecasts. By using a Bayesian approach, the hierarchical analog model is able to quantify rigorously the uncertainty associated with forecasts. 4. Forecasting waterfowl settling patterns in the northwestern United States and Canada is conducted by applying the hierarchical analog model to a breeding population survey dataset. Sea Surface Temperature (SST) in the Pacific ocean is used to help identify potential analogs for the waterfowl settling patterns.


| 791
McDERMOTT ET al. to the current state and then assumes that the current state of the system will evolve in a manner similar to how the identified past states evolved. For our purposes, an analog refers to a past state of some system that is similar to the current state of the system. Analog forecasting is appealing for dynamical processes governed by some underlying, but unspecified, deterministic law. Specifically, analog forecasting leverages the predictability in these types of systems by finding past trajectories similar to the current trajectory of the system.
Much of the current development of spatiotemporal analog methods utilizes the idea of embedding a dynamical system in time, similar to the simplex prediction method outlined in Sugihara and May (1990) for univariate time series. Indeed, the Sugihara and May (1990) approach was one of the first practical methods to introduce the idea of embedding a dynamical system in the context of nonlinear forecasting.
Their methods utilized the state-space dynamical system reconstruction theory of Takens (1981). In complicated dynamical systems, one rarely observes all of the state variables. State-space reconstruction allows one to reconstruct a dynamical system with only a subset of the state variables, by considering those state variables at multiple lags in the past. As dynamical systems evolve in time, they tend to revisit previous paths in the phase space, where these paths live on some low-dimensional manifold of the entire space (i.e., the attractor). Thus, through the use of state-space reconstruction, one can recover features of past dynamical paths along the attractor. Sugihara and May (1990) recognized the utility of state-space reconstruction within the context of nonlinear forecasting. In particular, they showed how embedding vectors, created by lagging past states of a system (historical data) in time (e.g., see Chapter 3 of Cressie & Wikle, 2011), could be utilized to find robust analogs for the current state of the system. In an ecological context, Takens' theorem (Takens, 1981) has also been utilized to analyze the relationships between multiple components of an ecological system in Nichols, Moniz, Nichols, Pecora, and Cooch (2005), and more recently in Sugihara et al. (2012). This remarkably simple forecasting method has proved successful in a multitude of time series applications (e.g., Perretti et al., 2013;Sugihara et al., 2012;Zhao & Giannakis, 2014).
Mechanism-free and analog methods traditionally have relied on nonparametric and/or heuristic approaches that did not include a formal probabilistic error structure (although, see Tippett & DelSole, 2013;Lguensat, Tandeo, Ailliot, Pulido, & Fablet, 2016). Modern nonparametric analog methods require choice of the embedding dimension of the analogs, the number of past analogs to consider, and weights for those analogs. All of these choices can significantly impact the analog forecast. For example, the question of how many past analogs to use can be thought of as a k-nearest neighbor problem, where the neighborhood consists of the analogs most similar to the current state of the system. Given the number of "neighboring" analogs, a kernel defined by a smoothing parameter is typically used to determine the weights (e.g., McDermott & Wikle, 2016;Zhao & Giannakis, 2014). However, previous analog forecasting implementations have employed either some heuristic method that does not explicitly account for uncertainty associated with the choice or a multidimensional cross-validation search (e.g., Arora, Little, & McSharry, 2013), to choose these values.
The Bayesian framework described in McDermott and Wikle (2016) allows for both the estimation and incorporation of model averaging over the various parameters in the analog model, thereby accounting for the uncertainty induced by their selection.
Once framed within the context of Bayesian modeling, analog forecasting can be placed within the rich class of models available in the space-time hierarchical Bayesian framework (e.g., Cressie & Wikle, 2011;Wikle, 2015), which allows for robust quantification of uncertainty. We present here a hierarchical analog forecasting model that extends the model developed in McDermott and Wikle (2016) to include a formal non-Gaussian data model -specifically, a Poisson model to accommodate count data. This is the first analog method that accounts explicitly for non-Gaussian data within a statistical framework. The model is applied to the problem of producing oneyear-ahead forecasts of waterfowl settling patterns given the state of the Pacific Ocean sea surface temperature (SST). Because spatiotemporal analog forecasting can quickly become prohibitive for highdimensional processes, we introduce an approach for spatiotemporal dimension reduction in count data known as nonnegative matrix factorization (Lee & Seung, 2001).

| Waterfowl and sea surface temperature data
Migratory waterfowl settling patterns, productivity, and survival have been shown to depend strongly on climate-related habitat conditions (e.g., Feldman, Anderson, Howerter, & Murray, 2015;Hansen & McKnight, 1964;Herter, 2012). Settling patterns of waterfowl are of substantial importance to wildlife ecologists. From a management perspective, knowledge of settling patterns aids in establishing appropriate harvest regulations across management units and in determining the efficacy of habitat treatments designed to improve habitat for waterfowl (i.e., Lavretsky, Miller, Bahn, & Peters, 2014). Further, given the migratory nature of waterfowl and ongoing theoretical interests in understanding factors affecting the distribution and habitat selection of migratory species, predicting settling patterns has broad relevance.
It is known that changes to habitat conditions can lead to more flexible settling patterns along a latitudinal gradient that can mitigate site philopatry, and possibly decrease productivity or recruitment (e.g., Becker, 2015;Johnson & Grier, 1988;Karanth, Nichols, Sauer, Hines, & Yackulic, 2014). Given the well-known relationships between Pacific Ocean (particularly the tropical ocean) SSTs and North American climate conditions (e.g., Philander, 1990) and the potential for these conditions to affect waterfowl settling patterns (Sorenson, Goldberg, Root, & Anderson, 1998), it is reasonable to use Pacific Ocean SST as a proxy for future habitat conditions. In addition, the impact of Pacific SSTs is typically nonlinear (Hoerling, Kumar, & Zhong, 1997), suggesting nonlinear evolution models are appropriate. Although others (e.g., Wu, Holan, & Wikle, 2013) have successfully forecast Mallard duck (Anas platyrhyncho) settling patterns using a drought severity index, we provide a one-year forecast given the Pacific SSTs through the previous May.

| Spatiotemporal variables
Let Y t (s i ) be a component of a dynamical system at time t with spatial locations {s i , i = 1, … , n y }. Suppose we have access to data from the system for time periods {t = 1, … , T}. The set of data at all n y locations for time period t is defined as, Y t ≡ (Y t (s 1 ), … , Y t (s n y )) � .
Here, we consider count-valued data for Y t . Further, we consider the use of some spatiotemporal forcing (predictor) variable, defined as, x t � = (x t � (r 1 ), … , x t � (r n x )) � , for spatial locations, {r 1 , … , r n x } and time t′, to help forecast the process of interest (i.e., Y t ). Note that the time indices t and t′ are separated by τ period(s) (i.e., τ = 0, 1, 2, …), with potentially different time scales. As discussed in more detail below, in our application, τ represents the number of periods the response variable is forecasted into the future. Thus, the goal here is to forecast the value of Y T+τ given values of Y t for t ≤ T and for x t ′ for t ′ ≤ T. This is performed by weighting the past values of Y t based on how well corresponding past sequences of x t ′ match the most recent sequence of x t ′ (i.e., the most recent sequence up to time T), as described below.
Many spatiotemporal dynamical processes can be challenging to model due to the high-dimensional nature of the spatial component. Both the BPS waterfowl settling pattern data and SST data described above can be considered high dimensional. To efficiently model such spatiotemporal processes, some form of dimension reduction is usually performed (e.g., see Chapter 7 of Cressie & Wikle, 2011). Common methods such as empirical orthogonal functions (EOFs) are not ideal for noncontinuous responses such as count data because it is difficult to impose constraints (e.g., such as nonnegativity). Although more general ordination methods such as principal coordinate analysis and multidimensional scaling can be useful for noncontinuous data (e.g., Ellison & Gotelli, 2004), these methods also do not guarantee, in general, that after dimension reduction and projection back into physical space, the resulting process has the same support as the original data.

| Response vector dimension reduction
Consider the case where we have n y spatial locations and the n y -dimensional response vector at time t, Y t . We seek a n β -dimensional expansion coefficient vector, t , associated with a set of n β basis func- seek a reduced dimension representation such that n β << n y . When considering a linear basis expansion, then, we seek Y t ≈ t , where ≡ [ψ 1 , … , ψ n β ] is a n y × n β matrix. Then, the ordinary least squares estimate of the expansion coefficients is ̃ t = ( � ) −1 � Y t , assuming ( ′ ) is invertible. In situations where is orthogonal, this simplifies to ̃ t = � Y t . As an example, derived from the scaled left singular vectors of a full data matrix, Y ≡ [Y 1 , … , Y T ], are the EOF basis functions, and are orthogonal. A reduced rank representation of the response vectors in phase space is given by Ỹ t = ̃ t . Typically, one then considers the expansion coefficients, ̃ t , as the time-varying variable of interest.
When Y t has a constrained support, as with the count data of interest here, there is no guarantee that this back transformation (Ỹ t = ̃ t ) will result in appropriate support for the elements of Ỹ t (e.g., nonnegative values). This issue can be important in some applications, such as the analog forecasting problem of interest here, as we specify the t 's in a hierarchical model and require nonnegative values upon transformation back to physical space.
We employ nonnegative matrix factorization (NMF) (e.g., Lee & Seung, 2001) to enforce nonnegativity in the dimension reduction in the count data matrix. Given the n y × T data matrix Y, NMF gives: where is a n y × n β basis function matrix and the n β × T matrix B ≡ [ 1 , … , T ] contains (random) projection coefficients. In reference to (1), the notation W ≥ 0 for some matrix W, implies that each element of W is nonnegative. NMF has been applied in a variety of disciplines because of its ability to provide efficient dimension reduction while creating nonnegative basis functions. A number of different algorithms to conduct NMF have been proposed in the literature (e.g., Berry, Browne, Langville, Plemmons, & Paul Pauca, 2007), all with the goal of solving the following minimization problem: where D (Y, , B) is a loss function and R( , B) is some regularization function. Unfortunately, these NMF algorithms do not produce a unique factorization. Instead, they converge to a local minimum, thus producing different factorizations for different starting values (e.g., Boutsidis & Gallopoulos, 2008). To alleviate this nonuniqueness problem in our methodology, we use the Nonnegative Double Singular Value Decomposition (NNSVD) approach of Boutsidis and Gallopoulos (2008) to obtain starting values. Note that NNSVD was designed to produce fast convergence for sparse data structures (i.e., when Y contains a large number of zeros, as is the case with our BPS settling pattern data). The application to follow uses the so-called off-set NMF algorithm of Badea (2008).

| Forcing vector dimension reduction
The purpose of the forcing variables, {x t � }, is to identify analogs to help predict the response variable. Further, the success of any analog forecasting model is largely determined by its ability to find robust analogs. If n x is large, we typically must reduce the dimension of the process using spatial basis functions, ≡ [ 1 , … , n α ], where k = (ϕ k (r 1 ), … , ϕ k (r n x )) � . As with the response vector, if we assume linear projections, we can get projection coefficients by t � = ( � ) −1 � x t �.
McDermott and Wikle (2016) show that these projection coefficients can be combined to form time lagged embedding matrices. That is, let q represent the number of periods lagged back in time, then for period t, we can define the following n α × q embedding matrix: These embedding matrices are critical to the success of the analog forecasting model outlined below. For example, suppose we wanted to investigate whether the response variable at period t−1 was a robust analog for the response at period t. One could construct an embedding matrix A t corresponding to period t and another matrix A t−1 for period t − 1. We could assess the quality of Y t−1 as an analog for the response at period t, by examining the "distance" between A t and A t−1 .
The selection of basis functions to obtain t ′ can be flexible here and different choices of could potentially produce different sets of analogs. For example, EOFs would be an obvious choice if linearity was assumed. However, there is scientific evidence of a nonlinear relationship between precipitation (which could potentially affect habitat conditions) and SST anomalies (e.g., Hoerling et al., 1997), so we investigated several nonlinear dimension reduction techniques for the waterfowl settling pattern application.

| Hierarchical analog forecasting model
We now discuss the specifics of the spatiotemporal hierarchical Bayesian analog (HBA) forecasting model for count data. All of the stages of the presented HBA model are summarized in Table 1 below.
As our responses {Y t : t = 1, … , T} are count valued, we model the data with a Poisson distribution conditional on a spatiotemporal intensity process as: where { t : t = 1, … , T} is the n y -dimensional intensity process at locations {s 1 , … , s n y }. Using the basis functions from the NMF approximation (1), let t = t . Recall, the NMF guarantees ≥ 0 and thus, for t to be nonnegative, the distribution for t should have nonnegative support. If we denote the model parameters by ̃ (see below), then for period t, the process model on t is given by the truncated normal distribution: where, for period t, we define B −t ≡ [ 1 , … , t−1 , t+1 , … , T ] as the matrix of possible analogs and t = (ω(A t , A 1 , ), … , ω(A t , A t−1 , ), ω(A t , A t+1 , ), … , ω(A t , A T , )) � , as the weight associated with each of the potential analogs. Thus, a weighted prediction of the new t is based on the linear combination of past t values, B −t t . Due to the form of (5), in particular the weighted averaged (i.e., B −t t ), we found that using a log-Gaussian formulation in (5) failed to preserve the correct scale of the analogs.
Further, as described in Cangelosi and Hooten (2009), for a normal density left-truncated at zero, the mean is biased and this bias increases for values close to zero (which is the case for many elements of t ) as the left tail of the distribution has been distorted from the truncation at zero. In equation (5), h( ⋅ ) is the bias correction function

Hierarchical Bayesian Analog Model
Data model: Process model: T A B L E 1 Hierarchical model summary presented in Cangelosi and Hooten (2009). The need for the constant ϵ arises since as B −t t → 0, we have h(B −t t , σ 2 η ) → −∞. Thus, ϵ is set to an arbitrarily small value for computational purposes. The weights ( t ) in (5) are critical to the success of the analog forecasting model presented here. For example, during the training of the model, these weights determine how much each potential analog in B −t is weighted in order to predict t . We describe our choice of weights in the next section. It is important to note that, although the weights are applied to the potential analogs in a linear fashion (i.e., B −t t ), the resulting prediction for t can be considered nonlinear as the weights are determined by a nonlinear function (i.e., the Gaussian kernel).
The choice of analog weights and the analog "neighborhood'' is closely connected and important considerations in analog forecasting.
Let  m (A t ) denote the neighborhood of the analog A t for period t, where the number of nearest neighbors considered is represented by m. Defining d( ⋅ ) as a distance metric and θ = {θ 1 ,θ 2 } as a set of kerneldependent parameters, we have the following kernel weight function: for ≠ t, where θ 1 is a kernel smoothing parameter and θ 2 is a parameter associated with the distance function (see the Appendix).
Let, ω(A t , A , θ) be the normalized version of ω (A t , A , θ), where the normalization is accomplished by dividing by the sum of ω(A t , A , θ) across all T − 1 potential analogs for period t. Any valid distance metric d( ⋅ ) can be applied here; for example, analog forecasting methods traditionally use Euclidean distance. However, analog forecasting relies on identifying analogs that not only resemble the current state of the system but also move forward in a similar manner. For this reason, analogs that share a similar trajectory in phase space as the current trajectory of the system will produce the most successful forecasts.
Procrustes distance (e.g., see Hastie, Tibshirani, & Friedman, 2013) is a multivariate distance metric that transforms a comparison object (i.e., A ) to a target object (i.e., A t ), before calculating the Frobenius matrix norm between the target object and the transformed comparison object. Therefore, by defining d(A t , A ; θ 2 ) as the Procrustes distance (see the Appendix for the full details, including the specification of θ 2 ), we are able to compare the shape, and thus, the trajectory, between two embedding matrices (see Figure 1 for a visual example). In the definition of A t , we let q represent the number of lagged time periods when forming A t . As different values of q will lead to different embedding matrices, and thus potentially different analogs, we estimate q and give it a discrete uniform prior such that, q ∼ DU(q min , q max ). We also assign a discrete uniform prior to the number of neighbors parameter, m ∼ DU(m min , m max ). Finally, θ 1 and σ 2 η are both assigned inverse-gamma priors, θ 1 ∼ IG(a 1 , b 1 ) and σ 2 η ∼ IG(a 2 , b 2 ). Sampling from the posterior distribution is accomplished with Markov chain Monte Carlo (MCMC) methods (e.g., Robert & Casella, 2004  , ω(A T+1 , A T , ( ) )) � , the projection coefficients for period T + 1 can be forecasted for the th iteration as, . In this example, A T+1 is the initial condition for which we seek matching past analogs. Then, from the definition of (3), the first element of A T+1 is T � +1 , which is lagged τ periods behind the forecast target time, T + 1, thus leading to a τ-period ahead forecast of Y T+1 (see Figure 1 for an illustrative example).

| Model setup
We evaluate the predictive ability of the model by considering fore- The HBA model was implemented for all forecasted years with the same tuning parameters and prior distributions. Note, as n β increases, the NMF basis function approximation in (1) generally becomes more accurate. Because there is a computational cost to using higher values of n β , we found that n β = 14 appropriately balanced computational efficiency with the accuracy of the approximation.
Regarding the SST basis functions, in addition to the more traditional empirical orthogonal function (EOFs; i.e., spatiotemporal principal components) linear dimension reduction, we implemented the following nonlinear dimension reduction methods: local linear embeddings (e.g., Roweis & Saul, 2000), diffusion maps (e.g., Coifman & Lafon, 2006), kernel principal component analysis (KPCA) (e.g., Scholkopf, Smola, & Muller, 1998), and Laplacian eigenmaps (e.g., Belkin & Niyogi, 2001). Our analysis found Laplacian eigenmaps to be the most helpful of these nonlinear methods for identifying robust analogs. Therefore separate models, one with EOF basis functions (HBA1) and a second model with Laplacian eigenmap basis functions (HBA2), were implemented. Approximately 82% of the variation in the SST data was accounted for by the first 16 EOFs (i.e., n α = 16). Laplacian eigenmaps are calculated through an eigenvector decomposition of a Laplacian matrix, whose construction involves an adjacency matrix formed through either a kernel or a nearest neighbor approach (e.g., Belkin & Niyogi, 2001). We implemented the nearest neighbor approach, with n α = 16 again, by sampling the number of neighbors as a parameter in the MCMC sampler over the following grid: Although we are applying Takens' embedding theorem (Takens, 1981) with a relatively short temporal span (i.e., approximately 40 years) using 16 state variables to represent the SST data, we seem to be able to counterbalance the effects of a shorter temporal span. A further examination of this trade-off between the temporal span and number of state variables is beyond the scope of this study, but should be investigated elsewhere. This span is short enough that potential nonstationarities in the SST data are not a major concern.

| RESULTS
Prediction skill of the HBA and PST models was evaluated through calculation of the MSPE, defined as the mean of the squared differences between the posterior predicted means and the observed counts averaged across all spatial locations. The correlation between the observed counts and the mean of the posterior predictions was also used to evaluate the forecasting models, as is often considered for spatiotemporal prediction (e.g., Wilks, 2006). As displayed in Table 2

| DISCUSSION
Overall, many of the aspects of analog forecasting that originally made it appealing to ecologists are retained by the HBA model. The model has few parameters and performs well with data from a relatively short temporal span. Unlike other analog forecasting methods, the HBA allows users to properly quantify uncertainty in a rigorous framework. With the growing number of high-dimensional spatial-temporal ecological datasets, analog forecasting in a hierarchical framework can provide ecologists with a rich framework for making accurate forecasts with principled uncertainty quantification. The count-based spatiotemporal hierarchical Bayesian analog model methodology developed here was successful in that it produced forecasts that had high correlations with observed counts, along with outperforming a hierarchical Bayesian Poisson space-time model (in terms of MSE and correlation).
The result that waterfowl settled more consistently in the northern half of the region of interest in 1999 despite the lack of correlation with patterns from the previous year was promising. We suspect that poor habitat conditions due to drought (e.g., Wu et al., 2013), possibly T A B L E 2 Results based on the posterior predictive distribution for the two HBA models, and the PST model. Models are compared via mean squared prediction error (MSPE) and correlation (Corr) of the forecasted values with observed values. linked to the tropical Pacific La Niña anomaly (e.g, Philander, 1990;Hoerling et al., 1997), could help explain why many waterfowl overflew the southern region in 1999 (e.g, Hansen & McKnight, 1964;Sorenson et al., 1998). The distribution of migratory birds is notoriously affected by climatic factors. For example, the timing of waterfowl migrations might be affected by climate as can the use of stopover sites, the distance travelled, and the ultimate location for settling (Schummer, Cohen, Kaminski, Brown, & Wax, 2014 Our work provides proof of concept for application of hierarchical spatiotemporal forecasting models to aid in prioritizing habitat management decisions based on waterfowl settling patterns. Due to the preponderance of zeros present in the waterfowl data, the assumption of equidispersion implicit in the data model (i.e., equation (4)) is likely violated. Wu et al. (2013)  Such an approach may be useful for forecasting seasonal or yearly settling patterns that are influenced both linearly and nonlinearly by some high-dimensional variable.
Count data in ecology are ubiquitous and the model we developed is an ideal alternative to currently available quantitative methods.
Ecologists routinely collect count data through visual surveys, such as the waterfowl dataset used herein, or through use of other remote technologies. For example, rapid advancement of radio-tracking technology (e.g, Kays, Crofoot, Jetz, & Wikelski, 2015) and remote-sensed cameras (e.g, He et al., 2016) has transformed the way ecologists collect count data. These widely used technologies have also changed the type of data obtained both in terms of amount and structure of resulting data. In particular, these technologies result in large data structures with spatial and temporal dependencies and our model provides an appropriate way to address these complexities while quantifying uncertainty in a rigorous manner. Often, these count data are used by ecologists to assess settling patterns, habitat relationships, or impacts of weather conditions and predict future states. For example, migration routes of terrestrial mammals are imperiled (e.g, Berger & Cain, 2014) and there is much effort to identify and predict use of program.

CONFLICT OF INTEREST
None.

DATA ACCESSIBILITY
The raw indicated pair counts (for the waterfowl settling pattern