Spatiotemporally resolved multivariate pattern analysis for M/EEG

Abstract An emerging goal in neuroscience is tracking what information is represented in brain activity over time as a participant completes some task. While electroencephalography (EEG) and magnetoencephalography (MEG) offer millisecond temporal resolution of how activity patterns emerge and evolve, standard decoding methods present significant barriers to interpretability as they obscure the underlying spatial and temporal activity patterns. We instead propose the use of a generative encoding model framework that simultaneously infers the multivariate spatial patterns of activity and the variable timing at which these patterns emerge on individual trials. An encoding model inversion maps from these parameters to the equivalent decoding model, allowing predictions to be made about unseen test data in the same way as in standard decoding methodology. These SpatioTemporally Resolved MVPA (STRM) models can be flexibly applied to a wide variety of experimental paradigms, including classification and regression tasks. We show that these models provide insightful maps of the activity driving predictive accuracy metrics; demonstrate behaviourally meaningful variation in the timing of pattern emergence on individual trials; and achieve predictive accuracies that are either equivalent or surpass those achieved by more widely used methods. This provides a new avenue for investigating the brain's representational dynamics and could ultimately support more flexible experimental designs in the future.


| INTRODUCTION
The use of decoding models for the analysis of magnetoencephalography (MEG) and electroencephalography (EEG) data has substantially grown in recent years (Grootswagers, Wardle, & Carlson, 2017).
Increasingly researchers are turning to these methods-collectively referred to as Multivariate pattern analysis (MVPA)-for their increased sensitivity to distributed patterns of variation that can be attributed to a stimulus (Haynes & Rees, 2006). While projecting high dimensional neural data down to a single metric of classification accuracy affords researchers greater statistical sensitivity, it generally comes at a cost to interpretability: the precise nature of the link between significant changes in decoding accuracy and the underlying neuroscience driving those changes can often be indirect or opaque (Haufe et al., 2014;Kriegeskorte & Douglas, 2019;Naselaris & Kay, 2015;Valentin, Harkotte, & Popov, 2020). Furthermore, commonly used methods that align all trials in time and proceed in a timepoint-by-timepoint fashion offer no sensitivity to patterns that are not perfectly synchronised in time across different trials (Borst & Anderson, 2015;Vidaurre, Myers, Stokes, Nobre, & Woolrich, 2019).
The main focus in the use of MVPA for M/EEG is to leverage these modalities' high temporal resolution to investigate the brain's representational dynamics over time (King & Dehaene, 2014). A convention has emerged for analysing trial data whereby successive spatial filters-each a vector with one linear coefficient applied to each sensor-are trained on data at different timestamps from the presentation of some stimulus (Carlson, Hogendoorn, Kanai, Mesik, & Turret, 2011;Carlson, Tovar, Alink, & Kriegeskorte, 2013;Cichy, Pantazis, & Oliva, 2014;Grootswagers et al., 2017;Haxby, Connolly, & Guntupalli, 2014;van de Nieuwenhuijzen et al., 2013). While this convention has certainly been useful, it is not without its flaws, as illustrated in Figure 1. Firstly, spatial filters are blind to any temporal structure in the data. These filters cannot detect patterns of temporal autocorrelation that are subtle but consistent over multiple timepoints. Secondly, spatial filters cannot be interpreted as reflecting the brain activity associated with stimuli; this has been widely characterised in the neuroscience literature (Haufe et al., 2014;Kriegeskorte & Douglas, 2019;Valentin et al., 2020), but is somewhat counter intuitive. As shown in Figure 1, given an evoked response from two different stimuli in the presence of correlated noise, the coefficients of a linear spatial filter need not visually resemble or reveal these underlying patterns. Post hoc methods have been proposed to map backwards from a spatial filter to recover a forward model of the data (Haufe et al., 2014), however these come with a number of caveats and are not guaranteed to provide the most accurate forward model parameter estimates. We instead propose to turn this approach on its head, and ask: why not first learn a generative model of the data, and then use that model to make predictions of unseen stimuli?
In neuroscience terminology, this amounts to fitting an encoding model, then inverting that encoding model to make predictions through an equivalent decoding model (Friston et al., 2008;Haxby F I G U R E 1 Conventional approaches to decoding in M/EEG are mass univariate through time and difficult to interpret in space without further post hoc analysis. In a typical M/EEG experiment decoding different types of stimuli-in this example, images of faces and houses adapted from Negrini, Brkic, Pizzamiglio, Premoli, and Rivolta 2017; (the exemplary maps are based on real MEG data from the dataset presented below)the conventional approach extracts all data at one timestep from the stimulus onset time and trains a spatial filter on that data to distinguish the conditions. This process is repeated independently at all timesteps. This approach ignores the time series nature of this data, and the spatial filters are neither able to detect patterns that are consistent over multiple timesteps, nor patterns that are consistent over trials but not perfectly aligned in time. Furthermore, the spatial filter coefficients are not directly interpretable (Haufe et Haufe et al. (2014Haufe et al. ( ) et al., 2014Naselaris, Kay, Nishimoto, & Gallant, 2011). This approach has been successful in fMRI (Casey, Thompson, Kang, Raizada, & Wheatley, 2011;Friston et al., 2008;Kay, Naselaris, Prenger, & Gallant, 2008;Mitchell et al., 2008;Naselaris, Olman, Stansbury, Ugurbil, & Gallant, 2015;Naselaris, Prenger, Kay, Oliver, & Gallant, 2009;Nishimoto et al., 2011;Schoenmakers, Barth, Heskes, & van Gerven, 2013) but has only seen quite limited adoption for M/EEG (di Liberto, O'Sullivan, & Lalor, 2015;Kupers, Benson, & Winawer, 2020). In a similar vein, we therefore propose a linear generative model of stimulus evoked activity based upon the popular General Linear Model (GLM) framework for M/EEG data (Trujillo-Barreto, Aubert-Vázquez, . While the GLM in neuroimaging is traditionally associated with mass univariate analysis, it can be extended to be multivariate across space, allowing multivariate predictions to be made on unseen test data through a simple application of Bayes rule (Friston et al., 2008;Haxby et al., 2014). Notably, the inversion of an encoding model in M/EEG can be framed as a solution to the inverse problem, where it has been successfully applied to obtain estimates of source-localised activity (Trujillo-Barreto et al., 2008)we apply the same general approach to instead make predictions about unseen test-data in a decoding-style manner. The widespread utility of the GLM approach in neuroimaging suggests it may have a broad applicability across a range of different experimental paradigms; in support of this claim, we demonstrate its use across two very different datasets.
Further, we show how this encoding framework can be extended to take advantage of approaches that adapt to timing differences across different trials (Anderson, Fincham, Schneider, & Yang, 2012;Borst & Anderson, 2015;Obermaier, Guger, Neuper, & Pfurtscheller, 2001;Vidaurre et al., 2019;Williams, Daly, & Nasuto, 2018), thereby more fully utilising the high temporal resolution that is the main benefit of M/EEG as a recording paradigm. Such temporal variability over trials can elucidate key aspects of cognitive processing (Anderson et al., 2012;Anderson, Betts, Ferris, & Fincham, 2010;Borst & Anderson, 2015;Borst, Ghuman, & Anderson, 2016), however has generally only been investigated using methods tailored to specific task paradigms (Anderson et al., 2012;Anderson & Fincham, 2014a, 2014b. In contrast, we show that the same hidden Markov modelling (HMM) approach can be formulated using the popular GLM framework, potentially allowing such questions to be pursued across a broader range of neuroimaging experiments (Vidaurre et al., 2019). With this approach, we are able to characterise the emergence of distinct representational states on individual trials and better model patterns that endure over multiple timepoints but may not be perfectly aligned in time. Previous requirements for patterns to be perfectly aligned over multiple trials limited experimental designs to paradigms that ensure maximal inter-trial reproducibility (Light et al., 2010), or alternatively designed to elucidate large changes in activity that could be more easily identified (Anderson & Fincham, 2014a;Borst & Anderson, 2015). Our more flexible modelling approach overcomes this limitation, potentially allowing more ambitious experimental designs in which subtle patterns of inter-trial variability are anticipated and can be quantified. We believe this opens a new door to investigating representational dynamics, by not only characterising when certain aspects of a stimulus emerge in data, but also asking how these representational dynamics are modulated across different trials.

| METHODS
Our approach extends the traditional general linear model (GLM) framework by (a) incorporating a latent Markov variable to explain time-varying dynamics within trials, and (b) modelling the multivariate spatial distribution simultaneously over all recorded channels. In so doing, we maintain the benefits of a generative model from the GLM approach, namely interpretable maps of linear activation strengths over each channel; alongside the benefits of multivariate methods, namely an increased sensitivity to distributed patterns of variation over multiple timesteps and amenability to hierarchical modelling (i.e., latent variable modelling of differences in dynamics over trials). These models are fit independently to each subject, resulting in distinct model parameters for each subject upon which standard group statistical tests can be run.

| Standard general linear model setup
We begin with the standard formulation of a GLM for evoked response analysis. For a specific subject, with a total of N samples of M/EEG data recorded across P sensors in a paradigm with Q regressors, we denote by Y the neural data of dimension N Â P ½ , with an associated design matrix of regressors X of dimension N Â Q ½ . This standard GLM setup is not explicit about how the N samples relate to time or trials. The convention in M/EEG analysis is to fit independent GLMs over successive timepoints within a trial in the same manner as in Figure 1, such that N would equal the number of trials and the effects of each stimulus are modelled independently at each timepoint within the trial. Note that our use of "neural data" here is generic; in this article we use raw sensor data, but this could equivalently be commonly used features such as bandlimited power. The GLM is formulated as follows: The matrix W, of dimension Q Â P ½ , is directly interpretable as each row is an estimate of the corresponding stimulus' activation pattern (i.e., its effects on each sensor's measured signal) while the confidence in that estimate depends on the entries of the P Â P ½ residual covariance matrix Σ.
The common use of the GLM in mass univariate analysis may lead some readers to mistakenly believe it is a univariate model. This is because, putting aside spatial priors that could be used on W, this model stores any multivariate information in the residual covariance matrix Σ; and this matrix is simply not used in mass univariate hypothesis tests that, for example, do group level inference on W.
However as has been established (Friston et al., 2008) We propose to extend this model hierarchically using the HMM framework. This assumes some level of correlation of activation patterns over multiple timepoints that is modelled by a latent state variable, at the same time as having the ability to capture differences in dynamics over trials. Specifically, we model the data vector observed at any point in time t on trial n as conditional upon a latent, discrete and mutually exclusive state Z n,t 1,2, …K ½ , as follows: where K is the number of states, and the kth state is associated with a distinct set of residual covariance patterns Σ k and stimulus activation patterns W k that linearly map between the trial stimulus X and the data Y. The latent states Z n,t can then be inferred (with suitable constraints, as outlined below) to explain exactly when each distinct pattern emerges on each different trial, allowing us to infer the combination of activity patterns and corresponding latent state timecourses (by which term we refer specifically to the posterior state Note we have used the notation X 1:N,1:T f g to denote the entire set of design matrix entries over trials 1 to N and over all timepoints within those trials 1 to T: While these design matrix entries could vary as a function of time within each trial, they do not in either of the datasets we analyse here (i.e., each trial has a single value for each column of the matrix X that does not change until the next trial). In the below analysis, we expand upon these terms, with modelling decisions explained and justified in turn. We omit the model evidence term (i.e., the denominator in the above equation) as it shall be methodologically sufficient to compute this posterior up to proportion.

| Data likelihood
The likelihood of each observation is conditionally independent of every other observation given the current value of the latent state variable. Thus, we can write the full likelihood of all observations over N trials each of T timepoints as a product over time and trials: F I G U R E 3 Bayesian model outline and left-to-right sequential state dynamics. Left Panel: the full model outline in Bayesian plate notation. For each of N trials of length T, we have data observations Y n,t conditioned upon the corresponding design matrix entries X n,t . These data observations Y n,t are also conditioned upon a latent Markov variable Z n,t which models the state sequence unique to each trial, and upon the associated state parameters W k and Σ k , which are modelled separately for each of the K total states. The latent state variables are themselves conditioned upon the transition matrix Φ, while the activation patterns in each state are conditioned upon an automatic relevance determination prior parameter Λ. Right panel: We depart from conventional HMM modelling, which freely permits any state to transition to any other state as in the diagram on the left, by instead imposing a left-to-right sequential HMM. As shown on the right, this restricts the permissible state transitions to a consecutive sequence, such that state 1 can either persist or transition to state 2 at each timestep; similarly, if state 2 is active it can either persist or transition to state 3 at each timestep. This structure imposes more aggressive regularisation to overcome the overfitting issues often associated with supervised HMM models where each individual observation is modelled by a GLM, as in Equation (2).

| Left-to-right latent state prior
We assume that P Z 1:N,1:T f g , W 1:K f g , Σ 1:K f g , Λ 1:K f g , Φ À Á factorises over parameters, that is: and model the latent state variable Z n,t using a Markovian prior: However, we make one important departure from the far more ubiquitous unsupervised HMM model. In the cases where HMMs are used for unsupervised analyses of M/EEG data, for example, to find resting state networks (Higgins et al., 2021;Vidaurre et al., 2018), they are typically allowed to transition freely between states. In supervised data analysis, this can lead to severe overfitting (Ghahramani, 2001), so we instead constrain the model to follow a common trajectory over each trial: we assume that the state order is a fixed sequence, where only the timing of state transitions is allowed to vary. Every trial begins in state 1 and subsequently progresses consecutively through the states, with state transition times freely inferred a trial specific basis (see Figure 3). This has the advantage of enforcing an interpretable sequence of activation over trials, while reducing the number of free parameters governing state transitions.
Where unconstrained HMMs must consider a full K Â K transition matrix, this left-to-right HMM constraint means we need only model a K Â 1 vector Φ, the kth entry of which (denoted by p k ) captures the probability of state k transitioning to state k þ 1. This structure is enforced by the following prior over the state transitions: We then set the following conjugate priors over the transition matrix entries (a Dirichlet distribution-as the multivariate extension of a Beta distribution-is the conjugate prior to HMM transition probability matrices in general, however as our left-to-right HMM constraint limits the dimensionality of each transition matrix row to 2, this is reduced to a standard Beta distribution): Beta p k jα, β ð Þ ð 8Þ We set the hyperparameters α ¼ 1 and β ¼ 1 (corresponding to a uniform distribution).

| Observation model parameter prior
For the observation model parameters W 1:K f g and Σ 1:K f g , we apply conjugate priors; this is an inverse Wishart distribution for the covariance matrix and an automatic relevance determination (ARD) prior as specified below for the stimulus activation patterns. The use of an ARD prior prunes away inferred stimulus activation patterns on sensors that are less consistent over trials, in a manner that performs favourably in neuroimaging (Woolrich et al., 2009;Yamashita, Sato, Yoshioka, Tong, & Kamitani, 2008). Denoting by w k,i,j the i, jth entry of the matrix W k , and similarly by λ k,i,j the i, jth entry of the matrix Λ k , this is implemented as follows: We then set the values of the hyperparameters to A ¼ 1 P I P and υ ¼ P, where I P denotes the identity matrix of dimension P. A result of this is that the expected value of the covariance matrix inverse Σ À1 k under this prior is the identity matrix I P , corresponding to a prior specifying normalised and uncorrelated data (which itself is a justified assumption given the model is fit to the principal component space as outlined below). We furthermore set hyperparameter values a ¼ 1 and b ¼ 1. This completes the Bayesian hierarchy.
By inferring a full covariance matrix and matrix of regression weights for each state, we have a potentially quite large parameter space that grows as a function of the number of states K. While the above priors ensure that the model parameters are always welldefined, their accuracy may nonetheless suffer when the number of states is large. Our weakly informative priors can be interpreted as adding a small number of virtual data points to the analysis that encode weakly held assumptions about the data-while this regularises the problem and ensures parameters are well-defined, if there are very few data points assigned to a state then these assumptions may still become overly dominant in the parameter estimation. This is one of the reasons we use cross validation to optimise the hyperparameter K. In other datasets where the number of datapoints is much more limited than those characterised here, our toolbox also supports further limiting the parameter space by restricting the form of Σ k to be diagonal and/or to be uniform over different states.

| Variational Bayesian inference
As a hierarchical extension of a linear model with conjugate priors, this model is amenable to classic variational Bayesian inference methods.
These methods are efficient and scale well to large datasets (Beal, 2003); in the context of the specified model, they are an extension of the popular expectation maximisation algorithm to full posterior inference (i.e., rather than point estimates) for each of the specified model parameters, which are generally more robust than point estimate methods (Johnson, 2007).
Variational Bayesian methods apply the mean-field approximation, whereby they assume the full posterior specified in Equation (3) can be approximated by the following factorised form: This factorisation assumption approximates the full posterior as a product of independent univariate distributions. The advantage of such an approach is that it allows the inference of the approximate posterior-which is otherwise analytically intractable-to be framed as an optimisation problem, where minimisation of the free energy equates to minimising the Kullback-Leibler divergence between the true posterior and the approximate factorised posterior. Crucially, this optimisation problem naturally breaks down to series of analytically tractable sub-routines by virtue of the factorisation applied (Beal, 2003 This procedure requires an initialisation step; we initialise the model using uniform (over trials), state timecourse parameters where each state is visited for an interval T=K seconds long (T being the length of the trial and K being the number of states). This corresponds to an initialisation assumption of no variation in the state timecourses or state duration over trials. We then proceed with the variational inference procedure as outlined above. For further technical details including variational update equations (see Higgins, 2019); for evidence of variational inference converging to known ground truth in simulations see Figure S1.

| Inverting the encoding model
where e Z test and e θ are parameter estimates defined fully below. Bayes rule used in this manner allows us to conveniently map between an encoding model and its equivalent decoding model, allowing us to make predictions of the associated stimuli for any unseen test data.
Note that P Y test jX test , e θ, e Z refers to the GLM observation model introduced above (Equation (2). When fitting this model to unseen test data, we substitute point estimates e θ ¼ e using the expected values from the posterior distributions inferred on the training dataset (Equation (10)).
We similarly need to estimate the expected values of the latent state variables e Z test for each timepoint and trial in the test data. However, as we are using a supervised HMM, these variables cannot be computed by the same inferential step used in training, as this would require previous knowledge of the held-out labels we are trying to predict (Vidaurre et al., 2019). Furthermore, existing Bayesian approaches that compute a joint posterior over both design matrix entries and latent state timecourses and then marginalise out over the latent state variables (Beal, 2003) are not tractable without imposing additional assumptions that seemed unsuitable to us here (see section 2 of Supporting Information). Emphasising that this problem arises only in the computation of held-out accuracy metrics and not the modelfitting pipeline (for which state timecourses are estimated using the variational Bayesian methods outlined in Section 2.3.4), we instead estimate the cross-validated state timecourse variables using a conservative post hoc procedure that is entirely independent of the true label values (shown schematically in Figure S3). To simplify notation let us define Γ t as the 1 Â K ½ vector of posterior state probabilities at time t, where the kth entry γ t,k ¼ P Z t ¼ kjY test , e θ , such that we wish to obtain an estimate of e Γ t for the held-out trials. Taking the training data Y train and the inferred posterior state timecourse probability Γ t , we train a linear regression model at each timestep within a trial to estimate the inferred state timecourses from the training data itself: We then use these linear weights to estimate the test set timecourses for all states from the data itself at each timepoint within a trial: where σ denotes the softmax function that ensures these are probability estimates that sum to one. We discuss the implications of this step later in the article and expand upon alternative choices in section 2 of Supporting Information (see also Figure S4). We also emphasise that this step is only used when computing cross-validated accuracy metrics, and as such it does not feature in the initial model fit (Figure 2a), the encoding model analysis (Figure 2b), or the analysis of pattern timing variation ( Figure 2c).
The other relevant term for our model inversion is the prior over the structure of the design matrix, P X ð Þ. We consider two cases, the first for where the design matrix consists of mutually exclusive classes (i.e., a classification paradigm); and a second where the design matrix may contain continuous valued regressors (i.e., a regression paradigm).
We refer to these as SpatioTemporally Resolved MVPA (STRM)-Classification and STRM-Regression, respectively.

| STRM-classification
When the design matrix consists of a total of Q mutually exclusive classes, the form of P X ð Þ is categorical. Assuming the vectors X t follow a one-hot vector encoding scheme, and writing the prior probability of each class as a Q Â 1 ½ vector C, we have the following: Assuming all classes are equally probable we have all entries of This leads us to an analytical solution to Equation (11) (see Appendix A for details): where L is a Q Â 1 ½ vector of unnormalized class likelihoods with vector obtained by taking the ith row of the identity matrix of dimension Q, and σ L ð Þ is the softmax function, which Note that e γ t,k is the estimate for the probability of each state being activated at time t (i.e., the terms discussed above that are estimated by a regression model).
This solution is equivalent to classification by Linear Discriminant Analysis (LDA); so, with the inferred latent state dynamics e γ t,k we now have a dynamic form of LDA. Note that in different applications, users may also choose to model the covariance matrix Σ as strictly diagonal; in which case this model is equivalent to a dynamic form of Gaussian Naive Bayes classification.

| STRM-Regression
Similarly, given a model that relates a set of continuous valued regressors X t to observed data Y t , we can make new predictions given some assumption of the overall distribution of the regressors; we will assume they are standardised and uncorrelated, such that: where I Q is the identity matrix of dimension Q. As a conjugate prior, this ensures for any new observation Y test that the posterior of Equation (11) has a Gaussian distribution (see Appendix B): In the absence of the latent state variable e γ t,k , this is a case of linear Gaussian systems (LGS) model (Murphy, 2012;Roweis & Ghahramani, 1999); thus, with the inclusion of the latent state variable, we have a dynamic LGS model.

| MEG visual data analysis
We test the model on two datasets (for additional verification on synthetic data simulated from the generative model see Supporting Information). The first dataset we analyse comprises a visual stimulus decoding task previously published as part of a larger study (Liu, (Colclough et al., 2016). This parcellation has been proven to be effective in a number of applications to MEG data (Colclough et al., 2017;Higgins et al., 2021;Vidaurre et al., 2018).
Source leakage was then corrected for by orthogonalization as outlined in (Colclough, Brookes, Smith, & Woolrich, 2015). Ultimately the reconstructed source data only accounted for 46% of the total variance in the original sensor data-thus, this data must be interpreted with caution as it only reflects a partial view of the total information available to the classifiers, but can nonetheless be informative as to the neural origins of this activity.

| STRM-Classification model
As the stimulus set was categorical, we established a design matrix with nine regressors; these corresponded to one regressor for each distinct visual stimulus, along with an intercept term. Thus, the inferred model coefficients for each latent state corresponded to a generic activation pattern over all stimuli for that state (i.e., the intercept term), and an effect specific to each visual stimulus for that state. We initially fixed the parameter for the number of states to K ¼ 8 and sought to explore the activation parameters and state timecourses for this fixed number of states. Only later in the pipeline, when determining classification accuracy, do we seek to optimise this parameter. Where the parameter optimisation procedure could not be used, we can demonstrate that our results and conclusions are not sensitive to the specific choice of hyperparameter (see section 3 in Supporting Information and Figure S5).

| Characterising state timecourse variability
The state timecourses inferred from fitting our STRM model characterise the timing at which specific activity patterns emerge on individual trials. We wished to explore what variables might influence these timings, and so we fit a post hoc multiple regression model asking whether specific behavioural and physiological variables allowed us to predict the timing of specific states on individual trials (as illustrated schematically in Figure 2c).
For each subject, given a STRM fit with K states, we fit K À 1 ð Þ multiple regression models, with the kth model predicting the timing of the transition from the kth to the k þ 1 ð Þth state (i.e., using the state transition time as the dependent variable). In each model we used the same four independent variables; the first two regressors were the inter-stimulus interval (ISI) and participant reaction time (see Figure 6), capturing variation in the overall timed structure of the trial. Note that the inter stimulus interval was randomly generated for each trial from a uniform distribution between 0.5 and 2 s, whereas the reaction time was determined by the participant and thus reflects a behavioural readout.
The remaining two independent regressors were designed to determine whether spontaneous changes in baseline electrophysiological patterns immediately prior to image presentation influenced the speed at which different patterns emerged. For this, we computed the power spectral density (PSD) in each ROI of the source space data over the 200 ms preceding stimulus presentation. This data was considerably high dimensional, so we applied a non-negative matrix factorisation across the data from all subjects to extract two primary modes of spatio-spectral variation. Specifically, we arranged the baseline PSD estimates for each subject across N ROIs in F frequency bands and M trials into a single matrix of dimension M Â NF ½ . We then applied a non-negative matrix factorisation (NNMF) decomposition to the row-wise concatenation of all subjects' PSD matrices as in Vidaurre et al. (2018), obtaining two main modes of spatio-spectral variation that corresponded to a broadband mode and a visual alpha mode (see Figure 6). We used the expression strengths of each of these two modes on each trial as the two independent variables in the multiple regression model, thus asking the degree to which baseline broadband power and baseline visual alpha power influenced the timing of subsequent visual processing states.
Finally, to eliminate collinearity we decorrelated all four regressors using symmetric orthogonalization. These multiple regression models were fit at the subject level and comprised a total of 4 K À 1 ð Þ multiple comparisons. We evaluated significance of effects at the group level through two tailed t-tests on the distribution of model coefficients over subjects, using Bonferroni correction of p values.

| STRM-Classification predictive accuracy
To assess the performance of STRM-Classification model, we used 10-fold cross-validation. This entailed a 10-fold partitioning of each subject's data, followed by an iterative procedure of holding out one fold for testing while training the model on the remaining data for that subject, then testing the classification performance using the procedure outlined in Equation (11) on the held-out partition. This was repeated iteratively until all trials had been classified; we defined classification accuracy as the proportion of trials upon which the correct label was predicted by the classifier.
Whereas we previously held the parameter K for the number of states fixed to a single value, the classification accuracy provides a clear metric by which this parameter could be optimised. Thus, the above cross-validation procedure was run using a total of 10 different values of K ranging from 4 to 22 in steps of 2. These can either be evaluated separately or optimised by nested cross-validation, with the data of the different subjects forming the outer cross-validation loop.
This latter procedure entails holding out one subject, determining the value of the parameter K that maximises the classification accuracy for the remaining subjects, and selecting that value when determining the accuracy for the held-out subject.
To assess statistical significance, we used two tests. To test whether the STRM-Classifier accuracy was significantly greater than the equivalent timepoint-by-timepoint classifier accuracy, we used non-parametric cluster permutation tests with a t threshold of 1 (Nichols & Holmes, 2003). To test whether the overall accuracy (averaged over all timepoints) exceeded other classifiers, we first applied a group level ANOVA followed by pairwise t-tests.

| EEG data analysis
To demonstrate the general applicability of this method, we also applied it to a set of EEG data collected during a decision making and reward learning task. This dataset was selected as it used stimuli (i.e., reward outcomes) that varied continuously rather than categorically, corresponding to the STRM-Regression model. Furthermore, the more complex task design involved additional variables that we believed may modulate neural processes and thus state timecourse patterns.

| Task outline
A total of 30 participants completed the task while undergoing EEG scanning. Twenty-three were of sufficient data quality to be included in the analysis (the other participants were excluded either because of not understanding the task or excessive noise). The task consisted of navigating between different foraging patches in an effort to maximise the total reward accumulated over the course of the recording session. Specifically, there were two types of decisions participants had to make. First, for the "site selection" phase, they chose between one of two patches; they then entered their chosen patch (after a variable waiting period), and accrued reward (the "Reward accrual" phase). During the reward accrual phase, participants were shown their current reward level which changed every second (i.e., decaying from different starting points and different decay rates). Participants were continually prompted: they could either do nothing or continue to accrue reward at a depleting rate, or press the space bar to collect their accumulated reward during this patch visit. They would then again be asked to choose between the two patches. Decay rates and starting points of both patches were learnable but changed periodically throughout the experiment.
For the purposes of our analysis, the only task event analysed is the period where participants receive the overall reward (the "Reward receipt" phase). At that moment, participants had to estimate the average reward rate they had achieved in the current trial in order to decide to leave earlier or later in the next trial or indeed pick the other patch next time. The average reward rate is computed by combining the accumulated reward at a patch and the time invested to receive it.
All participants signed written consent in advance; ethical approval for the experiment was obtained from the Medical Sciences for eye movements or blinks, we regressed out that component. In the next step we removed trials with excessive muscle artefacts by taking eight electrodes ("PO7," "PO8," "POZ," "PO4," "PO3," "O1," "OZ," "O2") near the neck which were most affected by muscle movements, z-scoring them and removing trials with a conservative zscore cutoff at 20. After muscle filtering a second visual inspection was applied to the data in case some very noisy trials remained. We downsampled the data to 100 samples per second using a polyphase filter with 25 Hz cutoff. Additionally, to further remove ocular artefacts, we regressed the EOG channel signals from each EEG electrode signal. We applied dimensionality reduction as previously described for the MEG data, using PCA with eigenvalue normalisation and keeping the top 20 components (out of 59 channels), which explained a total of 93.6% of the variance.

| STRM-Regression model
Each trial consisted of a presentation on screen of the reward value that had just been accumulated. For our STRM-Regression model, we used a design matrix, X, with two continuous regressors: the first being a mean activation value to model the mean change in activity for a state over all trials, and the second being the value of reward presented on screen in that trial. This second regressor was normalised over trials.
We fit the STRM-Regression model to the principal component space derived from the sensor level data as in (Grootswagers et al., 2017). This is done independently to the data from each session for each subject (note that for any group statistics we first averaged model parameters for a given subject over both of their recording sessions, then ran group statistics on these subject average values). As outlined in Section 2.5.3 we use a total of K ¼ 8 states (as previously, we initially hold this parameter fixed while characterising the model output, and later show how it can be optimised for decode accuracy).
Where the parameter optimisation procedure could not be used, we can demonstrate that our results and conclusions are not sensitive to the specific choice of hyperparameter (see section 3 of Supporting Information and Figure S6).

| Characterising spatial activity maps
Having determined the state timecourses from initially fitting STRM to the principal component space derived from the sensor level data, we then projected the state timecourse information back onto the original sensor data to obtain sensor activity maps per regressor per state. Maps in Figure 8 plot the group average value of model coefficients for each state; that is, the mean activity pattern associated with each state and the value evoked effect within each state.

| Characterising state timecourse variability
In exactly the same way as conducted for the MEG data, having determined the state timecourses from fitting our STRM-Regression model, we then asked whether the timings of these stimulus processing states were significantly modulated by other variables within the task -specifically, variables that the participant is required to track in order to optimally complete the task.
For each subject, we fit K À 1 ð Þmultiple regression models, with the kth model predicting the timing of the transition from the kth to the k þ 1 ð Þth state (i.e., using the state transition time as the dependent variable). In each model we used three independent variables.
These consisted of the value of the stimulus itself; the total time invested at the site (i.e., how long the participant had spent in order to accumulate the value shown in the stimulus); and the recent reward history (i.e., what the participant might expect to gain from leaving the site in search of another). To eliminate collinearity these variables were decorrelated by symmetric orthogonalization and normalised (Colclough et al., 2015). After fitting these models to the data for each subject, we conducted a group analysis testing whether any of the regressor coefficients were significantly different from zero. This involved 3 K À 1 ð Þ multiple comparisons; p values were evaluated for significance after Bonferroni correction.

| STRM-Regression predictive accuracy
In the same manner as outlined for the MEG data, we then assessed the performance of the STRM-Regression model to predict continuous stimuli using 10-fold cross-validation and showed how the parameter K controlling the number of states could be optimised by cross-validation over subjects. We used as accuracy metric the Pearson correlation between predictions and the true values. To assess statistical significance, when comparing accuracy versus time of two classifiers we applied a cluster permutation test with t-statistic threshold of 1 (Nichols & Holmes, 2003); when comparing the overall accuracy (averaged over all timepoints) of a group of models we applied a group ANOVA.

| RESULTS
We present here the results obtained from fitting the same model to two separate datasets. The first is a set of MEG recordings of categorical visual stimulus presentations, the second a set of EEG recordings of participants completing a cognitive task.

| Decoding visual stream representations using MEG
Visual stimuli evoke a cascade of feedforward and feedback activity through the dorsal and ventral visual streams (Goodale & Milner, 1992;Hochstein & Ahissar, 2002). Existing methods to identify the spatiotemporal evolution of these representational dynamics have relied upon methods that fuse MEG and fMRI recordings (Cichy et al., 2014) or require invasive recording types (Goodale, 1993). We asked whether these results could be corroborated using MEG as the

| Classification accuracy
An advantage of MVPA is that model fit can be straightforwardly assessed using the metric of classification accuracy. We therefore asked how STRM-Classification compared with conventional analyses. Figure 4 plots the classification accuracy obtained by STRM-Classification versus that of the equivalent conventional approach, namely timepoint-by-timepoint LDA classification. This model does lead to enhanced classification accuracy during later timepoints (where activity patterns might be expected to be less well aligned); however, this gain is relatively modest. It appears that although our model learns state timecourse information that we will show correlates with meaningful behaviour and physiological patterns, these do not result in dramatic gains to classifier performance. Performance is reasonably robust to the STRM model order; as shown in the second plot of Figure 4, performance is equivalent for K ¼ 10 or higher. This plot furthermore identifies a performance tuning curve, making this amenable to parameter optimisation and motivating the cross-validation approach we use to generate the upper and lower panels of Figure 4 (see Section 2).
There are two innovations that the difference in classification accuracy shown in Figure 4 (top panel) could be ascribed to; it could be due to the forward modelling procedure and the assumptions it imposes (which in this case are equivalent to the assumptions of LDA); and/or it could be due to the time-varying dynamics that our model permits. To test the effect of the forward modelling procedure, we compared the accuracies achieved by a range of different classification methods. First, we compared the classification accuracy obtained using generative classifiers (LDA and Naive Bayes) with those achieved by a range of discriminative machine learning classifiers (Support Vector Machines-linear and with radial basis function kernels; logistic regression in binomial and multinomial configurations; and K-nearest neighbour classifiers). All classifiers were implemented on a timepoint-by-timepoint basis. As shown in Figure 4, generative model-based classifiers offer significant improvements in classification accuracy over these methods. Thus, we conclude that-at least for this dataset-the assumptions imposed by our generative model accurately represent the data and afford greater classification accuracy as a result.

| STRM visual stream activation patterns
Given STRM-Classification models fit independently to each subject's data, we asked how consistent the inferred spatial patterns of activity were over subjects. Specifically, we asked whether ROIs emerged that were particularly informative of class differences, and whether these were the same across subjects. The interpretation of these source-reconstructed maps is subject to the usual caveats of the inverse problem, which is ill-posed without additional modelling assumptions (see Section 2.5.2)-these can nonetheless be informative if these modelling assumptions turn out to be a good fit for the data. Nothing in the model expressly enforces common patterns of activation for a given state across subjects, these are only unified by their common position in the temporal sequence of states (see  (Figure 6).
F I G U R E 4 Classification accuracy achieved by different MVPA methods. Top panel: Plotting the accuracy (mean over subjects ± SE) versus time for STRM-Classifier (with K optimised through crossvalidation-see Section 2) versus timepoint-by-timepoint spatially resolved decoding identify marginal improvements in classification accuracy over later timepoints when temporal patterns are identified. Middle panel: Plots of mean accuracy over all timepoints between 0 and 0.5 s as a function of the number of states K; plot shows mean over subjects ± SE. This shows this relationship is robust for values of the parameter K above a sufficient level; equivalent decoding accuracy is achieved when using K = 10 states or higher. Lower panel: the STRM-Classifier performs favourably when compared with discriminative classification methods. We here compare the STRM classifier with eight other classification methods-three other spatially resolved classifiers (LDA with optimised sliding window length-see Section 2; LDA using the conventional timepoint-by-timepoint decoding approach, and Naive Bayes using the conventional timepoint-by-timepoint decoding approach); and additionally five different discriminative classifiers fit using timepoint-by-timepoint methods (Linear SVMs, non-linear SVMs using a radial basis function (RBF) kernel; binomial and multinomial logistic regression (LR), and Knearest neighbour (KNN) classifiers with K optimised through crossvalidation). We find in general that generative encoding model based classifiers (STRM, LDA and Naive Bayes) outperform discriminative classifiers (SVM, LR and KNN), and furthermore that STRM decoding outperforms equivalent methods when used with the conventional timepoint-by-timepoint decoding approach; however, we similarly find that these gains are slightly surpassed by optimised sliding window methods. Asterisks denote significantly different from STRM-Classifier accuracy at Bonferroni corrected levels; green asterisks show significantly higher accuracy (Sliding Window LDA); blue significantly lower (all other classifiers) We found a strong relationship between response times and state transition times. This is significant for all states from 3 onwards, with an increasing effect size toward later states. ISIs, on the other hand, did not exhibit a significant relationship; there is an apparent trend whereby longer ISIs were associated with faster state sequences, but this was not significant after multiple comparison correction.
Spontaneous changes in the baseline power similarly modulate the subsequent propagation of visual processing states in a manner that appears selective to specific stages in the visual hierarchy. High levels of broadband power are associated with earlier transitions out of state 1 into state 2, whereas high levels of visual alpha band power are associated selectively with later transitions out of state 4 into state 5-the state we associate with emergence of information outside visual cortex. This is broadly consistent with proposed roles for alpha governing top-down control of attention and gating of information as it moves downstream (Jensen & Mazaheri, 2010); the relationship between broadband activity and earlier visual states is perhaps more surprising, however such patterns have been shown to be strong F I G U R E 5 Resolving the successive stages of visual stimulus processing in space and time. Fitting the STRM-Classification model independently to each subject's MEG data recorded during visual stimulus presentation, using K = 8 states allows us to investigate the stages of visual stimulus processing and the times on individual trials at which they emerge, subject to the usual interpretational caveats associated with the inverse problem and loss of information content in our source reconstruction methods. Top panel: Average timing of each state over a trial (mean ± SE over subjects), demonstrating the mean time after stimulus presentation that each state emerges. Lower central panel: a raster plot of state timecourses inferred for a sample subject over 248 trials, demonstrating the variability in timings over successive trials within the common left-to-right HMM pattern progressing from state 1 to 8. Lower outer panels: the thresholded, group-mean f statistics, per ROI, as a result of multiple subject-level ANOVAs; this displays the amount of information contributed by that ROI to discriminate the different visual stimuli (see Section 2). Statistics are thresholded at the 75th percentile of all test statistics obtained modulators of neural activity in a wide variety of ways (Miller et al., 2014;Podvalny, Flounders, King, Holroyd, & He, 2019).
These examples serve primarily as a model validation, demonstrating that the observed variation in activity pattern timings does not arise randomly but rather is reflective of underlying neural processes that are behaviourally and physiologically relevant. They furthermore suggest that activation timings in different stages of the visual cortex are not modulated uniformly, but rather that different variables may selectively affect propagation timings in different parts of the visual hierarchy.

| Decoding stages of cognitive processing in EEG
The ability to discern and analyse the timing of activity patterns emerging across different trials is potentially more salient in the case of non-sensory representations. Recently, decoding has emerged as a popular paradigm for analysing higher cognitive functions in complex tasks at high temporal resolution (Holdgraf et al., 2017;Kriegeskorte & Kievit, 2013), however these methods still mostly assume these higher cognitive functions are perfectly aligned across trials. We thus asked F I G U R E 6 The timing of visual processing is modulated by behaviour and physiology. Panel a: inter-stimulus intervals do not significantly modulate state transition times. Panel b: Longer participant reaction times are predictive of delayed transitions into states 3-8, with increasing effect size toward later states. Panel c: increases in baseline broadband power are associated with more rapid transitions into state 2, an early visual processing state. Panel d: increases in baseline alpha power over visual areas is associated with delayed transitions from state 4 into state 5. In all bar plots, asterisks denote significance at Bonferroni corrected levels (p = 2.1e À 3) whether the model we had proposed generalised to a very different EEG dataset, that involved prediction of a continuous variable (value) over multiple trials in a complex foraging task, where contextual variables differed substantially across trials and could potentially determine activity pattern timings.

| Accuracy of dynamic LGS decoding
We first asked whether STRM-Regression resulted in benefits to decoding accuracy. We measured accuracy by the cross-validated Pearson correlation between test set model predictions and their true values. We found no significant differences in performance between STRM-Regression and other metrics. We also asked how sensitive this behaviour was to the parameter controlling the number of states; as shown in Figure 7 (middle panel) it is very consistent across all values of this parameter tested (no significant differences were observed; one way ANOVA p = .98).
We then investigated what drove this performance and how it compared with similar methods. An equivalent encoding model trained using optimised sliding window techniques (i.e., sliding window LGS; see Section 2.6.6) achieved equivalent performance. There remained a small trend, where the sliding window method performed marginally better than STRM-Regression, mirroring the relationship obtained for STRM-Classification, but these differences were not significant. Overall, they suggest that the model's sensitivity to trial specific differences in pattern onset timing-which we can show below have strong correlations with behavioural variables-is nonetheless not a major determinant of predictive accuracy. Finally, the question remains whether the overall forward model-based approach itself performs favourably compared with other discriminative methods. To assess this, we compared model performance against that obtained by three decoding methods: linear regression and SVM regression using either linear or radial basis function kernels. Comparing all models on the basis of their correlation coefficient identified no significant group variation (one way ANOVA: p = .19), despite an apparent trend for the decoding models to achieve lower correlations. Overall, these results suggest that the generative forward model approach to decoding performs no worse than more commonly used regression techniques; and that their use alongside more targeted time series methods (i.e., the STRM and sliding window models) may offer very slight performance improvements.

F I G U R E 7 Predictive accuracy of STRM-Regression decoding
Top panel: plotting the Pearson correlation between model predictions and true regressor values over time, we see equivalent performance by both metrics. The STRM-Regression output shown here was obtained by optimising the value of K (the number of states) by subject-level cross-validation (see Section 2). Middle panel: this performance is robust over a range of values for the parameter K controlling the number of states, with STRM-Regression displaying no significant difference from synchronous models for all values of K tested (ANOVA, p = .98); this plot does further identify a performance tuning curve that justifies the use of optimisation through cross-validation. Lower panel: Performance of STRM-Regression against a range of different decoding models, both generative (LGS-in both synchronous and optimised sliding window modes) and discriminative (linear regression and Support vector machine regression, using linear and radial basis function kernels). The Pearson correlation shows no significant difference between groups (ANOVA; p = .19). While not significant, the trend is the same as obtained for STRM-classification: STRM-Regression is broadly consistent in performance with its timepoint-by-timepoint LGS equivalent, however optimised sliding window methods are trending toward superior performance than STRM-Regression. There is no evidence that discriminative models (Linear Regression and SVM regression) in general outperform generative models (LGS and STRM), with the results here trending in the other direction 3.2.2 | Spatially resolved stimulus activation maps Each trial within the foraging task consisted of participants being presented with an amount of reward they had just accumulated (see Section 2.6.1 and Figure 9 for full task outline). We first set out to decode the amount of reward presented on each trial and determine the origins of the activation patterns that predict it.
We fit STRM-Regression models, independently for each subject and session, to one-second epochs of EEG data following presentation of the amount of reward the participant had received. This allowed us to identify clear sensor space spatial topographies associated with each regressor in the STRM-Regression design matrix, with the first regressor representing a common mean pattern for each state over all trials and a second representing the value of the reward received. As shown in Figure 8, these maps identify patterns emerging initially over medial parietal regions in sensor space and propagating forward over successive states. The spatial location of the value signal is consistent with the literature on value encoding, which is associated with encoding initially in the superior parietal cortex and later in the medial prefrontal cortex (Hunt et al., 2012;Kolling, Behrens, Mars, & Rushworth, 2012). While the sensor space maps in Figure 8 do not afford the same precision in terms of anatomical interpretation as achieved in the visual MEG source space maps of Figure 5 (note that we did not perform source reconstruction on the EEG data due to the F I G U R E 8 The stages of EEG Value Processing. Fitting the STRM-Regression model independently to each subject's EEG data with K = 8 states identified a consistent pattern of sequential activity on each trial, but with significant variation in the timing of events on individual trials. Top panel: mean state timecourse ± SE over subjects. Lower centre panel: example state timecourses for one subject exhibiting significant variation over trials. Lower outer panels: Mean (over subjects) activation patterns for each state and each regressor. Each trial is associated with consistent patterns of activity, comprising a mean pattern of activation common to that stage of cognitive processing on all trials and a separate value-specific component. Both components are characterised by medial parietal activation; in the case of the value signal this appears to emerge initially in parietal areas (e.g., state 2) and later move to more frontal regions (state 6). In response to concerns about eye movement artefacts accounting for significant decode-ability, none of these topographies appears eye movement related moderate variability in cap and electrode positioning across subjects), they nonetheless suggest an anatomically derived interpretation: we can loosely interpret states 1-4 as representing the stages of value processing concentrated on parietal areas, whereas states 5-8 represent the emergence of a value processing state more concentrated on medial frontal regions.
One common concern using decoding in EEG is whether the prominent artefacts associated with eye movements may confound the decoding accuracy. To make neuroscientific statements, experimentalists must be assured that the underlying source of their decoders' performance is neural and not some muscle or ocular confound.  Figure 8; the clear midline activity patterns that these maps show are therefore direct evidence to refute claims of muscle confounds and eye motion artefacts driving the decoding result.

| Timing of value processing is modulated by cognitive variables
Effective completion of the foraging task requires participants to reconcile the amount of reward they receive on each trial with (a) the "accrual time," that is, the amount of time they had invested at the site in order to accrue that much reward, and (b)  These two variables we will refer to as cognitive variables, as they are not explicitly presented to the participant on screen as a stimulus but must be estimated and tracked alongside the completion of the task.
F I G U R E 9 The timing of value processing is modulated by cognitive variables. We investigated whether latent cognitive variables within the overall task structure significantly affected the timing at which different stages of value processing emerged on different trials. Specifically, we asked whether three regressors-the overall reward accumulated, the accrual time, and the recent reward rate-could predict the transition times between states on individual trials. We found no significant effect of reward accumulated, the variable being decoded, however we found significant effects of accrual times (reflecting how much time the participant invested to obtain the reward) and recent reward history (reflecting the expected value of alternative sites). Longer accrual times were associated with more rapid transitions through early stimulus processing states, whereas high rewards in recent history were associated with delayed transitions into intermediate stimulus processing states We asked whether these cognitive variables influence the timing at which value is processed on individual trials. We investigated this by fitting a multiple regression model to predict the timings of the transitions between different states using the state timecourses inferred by STRM. This multiple regression had three regressors: accumulated reward, accrual time and recent reward rate, as shown in Figure 9. We found no significant relationship between value (the regressor being decoded) and state transition times at Bonferroni corrected levels. We did find a strong relationship between accrual time and state progression, with longer times invested at the site associated with more rapid transitions into states 2 and 3, the very earliest stages of stimulus processing in parietal cortical areas. We In a closely related line of work, Anderson & Fincham (2014a) have shown that such timing variability can be highly informative about how participants actually complete a task (Anderson et al., 2010(Anderson et al., , 2012Anderson & Fincham, 2014b;Borst et al., 2016;Borst & Anderson, 2015). Their methods overlap with our own, with the key difference that they use an HMM with an unsupervised observation model. This means a model where the likelihood of the data is independent of any stimulus information (i.e., the design matrix X t ). A range of similar methods exist outside the HMM framework which have all demonstrated that-in certain tasks-unsupervised patterns can be informative of the task structure itself (Sako glu et al., 2010;Sporns, Faskowitz, Teixeira, Cutts, & Betzel, 2021;Wu, Caprihan, & Calhoun, 2021). In contrast, our supervised model applies the GLM framework such that the likelihood of the data is conditioned upon this stimulus information (as specified in Equation (2)). This is a very significant difference in practice; unsupervised models may detect large-scale fluctuations in distributed activity-such as that between active perception and motor response phases of a task (Borst & Anderson, 2015)-but will have limited sensitivity to more subtle differences, such as those that are evoked by different visual images.
Thus, we argue that our methods build upon these by generalising the study of inter-trial timing differences to a much broader range of experimental paradigms.
The approach of fitting an interpretable encoding model to training data and then performing a model inversion to make predictions about unseen data is well established in the fMRI literature (Casey et al., 2011;Friston et al., 2008;Kay et al., 2008;Mitchell et al., 2008;Naselaris et al., 2009Nishimoto et al., 2011;Schoenmakers et al., 2013), but has only seen limited adoption so far in M/EEG (di Liberto et al., 2015;Kupers et al., 2020). Our focus in this article has been to emphasise the interpretability benefits of this approachwhich are often overlooked-and demonstrate how it can be readily extended to time series analysis for data at high temporal resolution in ways that we believe offer significant benefits to conventional timepoint-by-timepoint decoding. suggesting that the subtle changes in activity pattern timing are not a major determinant of accuracy. Another potential issue with our model is the assumption of mutual exclusivity between states, which produces sharp jumps in time between inferred activity patterns. If the underlying activity patterns are in fact smooth, then their arbitrary discretization introduces larger errors for timepoints adjacent to state boundaries. On the one hand, this approach is highly interpretable and supports relatively straightforward analyses of the impacts of different behavioural variables on state transition times; on the other hand, this may come at a cost to predictive accuracy around state boundaries. We furthermore expect, given the challenge of overfitting with these datasets that motivated our left-right state constraint, that any comparable dynamic models with a continuous state space (e.g., Penny & Roberts, 1999) would be very difficult to suitably regularise.
The STRM model requires a hyperparameter specifying the number of states. As we have shown, this hyperparameter can be optimised through cross-validation, finding values that maximise predictive accuracy. Nonetheless, the different parameter values for each subject make group comparison difficult, such that in some cases this parameter must be assigned arbitrarily. Researchers concerned about such an arbitrary selection could perhaps base their decision upon the temporal generalisation profile of their data (King & Dehaene, 2014)as a rule of thumb we might suggest matching the expected state duration (i.e., T K ) to the minimum width of the diagonal pattern observed in such temporal generalisation matrices.
Our work is very closely related to a number of others; notably a seminal work which showed how any decoding model could be inverted to recover an interpretable forward model of the original data (Haufe et al., 2014). We instead propose the opposite approach; to fit an encoding model of the data and the invert that to make predictions. How are these two approaches different in practice? The first point we would make pertains to the classification accuracies we have reported; when using generative classifiers these are generally either better or at least not significantly different than a range of discriminative models tested, a finding that appears consistent with other examples in the literature (Grootswagers et al., 2017;Guggenmos et al., 2018). Note that all of these discriminative classifiers should, given enough datapoints, converge to the same boundary; generative classifiers however can learn faster if their modelling assumptions are accurate (Ng & Jordan, 2002). This ultimately equates to greater estimation error in the inverted model parameters; that is, the error in forward model parameters obtained via a model inversion is likely to be greater than the error in a directly fitted forward model. We can demonstrate this using simulations (see section 4 in Supporting Information). While the improvement obtained is modest when using a directly fitted forward model compared with model inversions, we can also demonstrate this improvement is not exclusively limited to scenarios where generative classifiers achieve greater classification accuracy than discriminative classifiers (see section 4 in Supporting Information). Furthermore, the interpretation of an inverted decoding model can be problematic wherever regularisation is used (Haufe et al., 2014). Given the ubiquity of regularisation in decoding applications this is not a niche problem. Thus, where interpretation of forward model parameters is the goal, we would argue that one should just fit a forward model directly.
Finally, our work was originally motivated by that of (Vidaurre et al., 2019), which introduced the HMM architecture in the context of decoding models. This demonstrated the potential of time series models for multivariate analysis, but left open the question of model interpretability. Similar to the question we posed in Figure 1, if each state is a different spatial filter, then how should one interpret observed differences in timing on individual trials? Given the lack of direct interpretability of spatial filters themselves, this question does not present an obvious answer. In contrast, by setting up our work around a forward encoding model based on the widely used GLM framework, each state has a clear correspondence to a set of stimulus activation patterns. This affords a straightforward interpretation of each state as a successive stage of stimulus processing, allowing us to then explore its spatial and temporal properties and provide a richer picture of the overall patterns of activation.

| CONCLUSION
Neuroscientists want to understand what information is represented in brain activity, as well as how and when it is expressed. Conventional methods limit what researchers can investigate in several ways. By obscuring the spatial patterns from which decoding accuracy metrics are derived, researchers are often left to interpret a result without clear knowledge of its spatial origins. By assuming the same process occurs at the same millisecond on every trial, researchers are unable to investigate meaningful patterns of temporal variation over trials and are limited in the experimental designs they can pursue.
The STRM model addresses these points with two main innovations; firstly, through the use of an interpretable encoding model to reveal how activity patterns are spatially distributed, and secondly through the use of an HMM to reveal how activity patterns are temporally distributed. M/EEG recordings offer the potential to reveal how the brain represents information at millisecond resolution; the methods we have developed seek to leverage this resolution to simultaneously answer the question of when and where these patterns emerge. By characterising both the spatial and temporal characteristics of neural representations, we may obtain a more holistic understanding of brain function, offer a new perspective on the role of timing in cognitive processes, and support more flexible experimental designs in the future. where L is a Q Â 1 ½ vector of unnormalized class log-likelihoods with each entry L i ¼ P K k¼1 Àe γ t,k 2 Y t À I i e W k e Σ À1 k Y t À I i e W k T . It, therefore, follows that P x t ¼ ijY t , e θ, e Z ¼ σ i L ð Þ:

ACKNOWLEDGMENTS
Or in vector notation: P X t jY t , e θ, e Z ¼ X t σ L ð Þ where σ L ð Þ is the softmax function which outputs a Q Â 1 ½ vector with each entry σ i L ð Þ ¼ e L i P Q j¼1 e L j .

APPENDIX B B.1 | STRM-Regression Bayesian model inversion
We wish to infer the posterior distribution given by: P X t jY t , e θ, e Z ¼ P Y, e θ, e ZjX t