An application of Bayesian measurement invariance to modelling cognition over time in the English Longitudinal Study of Ageing

Abstract Objectives Recommended cut‐off criteria for testing measurement invariance (MI) using the comparative fit index (CFI) vary between −0.002 and −0.01. We compared CFI results with those obtained using Bayesian approximate MI for cognitive function. Methods We used cognitive function data from Waves 1–5 of the English Longitudinal Study of Ageing (ELSA; Wave 1 n = 11,951), a nationally representative sample of English adults aged ≥50. We tested for longitudinal invariance using CFI and approximate MI (prior for a difference between intercepts/loadings ~N(0,0.01)) in an attention factor (orientation to date, day, week, and month) and a memory factor (immediate and delayed recall, verbal fluency, and a prospective memory task). Results Conventional CFI criteria found strong invariance for the attention factor (CFI + 0.002) but either weak or strong invariance for the memory factor (CFI −0.004). The approximate MI results also supported strong MI for attention but found 9/20 intercepts or thresholds were noninvariant for the memory factor. This supports weak rather than strong invariance. Conclusions Within ELSA, the attention factor is suitable for longitudinal analysis but not the memory factor. More generally, in situations where the appropriate CFI criteria for invariance are unclear, Bayesian approximate MI could alternatively be used.

(such as attention and hearing). The demands on other functions will differ between tests. Each additional function utilised in performing each individual task may be differentially affected by ageing, disease, or setting (McAvinue et al., 2012;Wiegand et al., 2014). As well as different rates of change secondary to cognitive or physical processes, the size of practice effects may also vary between different tests of the same cognitive function (Calamia, Markon, & Tranel, 2012). Any of these may change the strength of the association between the individual cognitive tests and the latent cognitive function over time. In factor analysis, this manifests as a change in factor loading or intercept and is known as MI (van de Schoot, Lugtig, & Hox, 2012).
MI has been discussed extensively elsewhere and has been identified as a problem in longitudinal studies of cognitive function since at least the late 1980s and early 1990s (Horn & McArdle, 1992;Schaie, Willis, Jay, & Chipuer, 1989). With some notable exceptions, population and clinical research on cognitive function has had a tendency to overlook this issue with a preference for using summed scores, the measurement properties of which are often not examined (Blankson & McArdle, 2013;McArdle, Fisher, & Kadlec, 2007;Wicherts, 2016). If this issue is ignored, it biases estimates of change in cognitive function over time towards the direction of the change in latent intercept or varying effects for a change in factor loading (Ferrer, Balluerka, & Widaman, 2008;Horn & McArdle, 1992;van de Schoot et al., 2013;Wicherts, 2016;Widaman, Ferrer, & Conger, 2011). For example, practice effects would be expected to increase the intercept leading to an overestimation of cognitive ability at follow-up visits and thus underestimation of decrease over time (Wicherts & Dolan, 2010). Alternatively, a decrease in factor loading due to increased sensory impairment over time weakening the association between measurable and latent cognitive function could lead to overestimation of cognitive function for low scorers and underestimation for high scorers as time progresses (Wicherts, 2016).

| Conventional MI
Underlying a set of k (n = 0, …, k) continuous observed variables c that have been measured, there is a latent variable η (Muthén & Asparouhov, 2013;van de Schoot et al., 2013). If they are measured in individual i at time t, the measurement part is Here, c ikt is the observed value of variable k at time t in individual i, v kt is the intercept for variable k at time t, λ kt is the loading for variable k at time t, η it is the value of the latent variable at time t for the variable k, and ε ikt is the error for individual i at time t for observed variable k. This model assumes independence amongst the c's conditional on the factor that the residuals are uncorrelated with the factors and the errors are normally distributed with a mean of 0. The factor metric is usually set by fixing λ = 1 for one observed variable.
A linear growth curve for factor scores (the structural model) is Here, η 0i is the intercept of the latent variable, η 1i is the slope growth factor, and Ϛ it is the time and individual specific residual. The binary case is a straightforward extension of Equation (1), and if a probit link function is assumed, then the latent variable is assumed to follow a continuous distribution and the structural model is unchanged. Otherwise, it should be noted that the intercept v kt is replaced with the threshold -τ tl (Muthen, 2004).
For continuous variables, the specification of MI consists of (a) the same variables load onto the same factors at each time point (the same vector of c ikt for each η it ), (b) the factor loadings are equal at Schoot et al., 2012;Widaman et al., 2011). If only a holds, this is known as configural invariance, a-b weak invariance, a-c strong invariance, and a-d strict invariance. In the case of binary observed variables the second stage, weak factorial invariance is skipped because the item probability curve is influenced simultaneously by loading and intercept (Muthén & Muthén, 2014).
Strong invariance needs to be established in order to compare latent means over time (Ferrer et al., 2008;Widaman et al., 2011) (Blankson & McArdle, 2013;Muthén & Asparouhov, 2013). Therefore, with large sample sizes, alternative fit indices, in particular the comparative fit index (CFI), are frequently used instead (Cheung & Rensvold, 2002;Meade & Bauer, 2007 (Muthén & Asparouhov, 2013;van de Schoot et al., 2013;Verhagen & Fox, 2013). The basic effect of approximate MI is that instead of requiring that all loadings be exactly equal, they are instead "tethered" so that they do not have to be exactly equal but are allowed to differ only by a substantively unimportant amount.
As described above, the conventional condition which must be met for strong factorial invariance (and therefore the ability to measure change in latent means over time) is that for each of the observed and λ k1 − λ k3 = ђ k13 . Also, let be the difference between v's such that frequentist assumption of strong invariance can then be defined in Bayesian terms as the strongly informative priors of ђ kXX~N (0, 0) and и kXX~N (0, 0) (Muthén & Asparouhov, 2013).
Given that, from a Bayesian perspective, the factor loadings and intercepts are random variables, the assumption of 0 variance is difficult to envisage in this framework. With approximate MI, this is instead relaxed slightly to a still strong but more plausible informative prior with 0 mean and small variance such as ђ kXX~N (0,0.01) and и kXX~N (0,0.01). One reason for preferring the Bayesian approach in this situation is that this assumption of exact equality is relatively unrealistic in a number of situations due to issues such as random variation across many time points, attrition, or practice effects (Blankson & McArdle, 2013;Putnick & Bornstein, 2016). The researcher can decide a priori how long to make the tether by specifying an appropriate prior for the difference between loadings or intercepts over time.
The size of the prior variance therefore sets the length of the tether and formalises the degree of invariance which is allowable.
The difference at each time point is tested to see whether it is statistically significantly different from the mean of the loadings at all time points. This tells you if any of the loadings have broken the tether and show a degree of noninvariance beyond that believed to be unimportant by the researcher. Additionally, this overcomes the problems in identifying the truly noninvariant parameters caused by fixing one indicator's loadings at 1 for all time points. Using the Bayesian approximate MI approach, one need only fix single loading for a single observed indicator at a single time point to 1 (Muthén & Asparouhov, 2013;Xu & Green, 2015).
An alternative frequentist approach to testing for MI is running models with and without MI to see if the results are conflicting (Widaman et al., 2011). With this approach, a, often informal, decision is made about the degree of conflict in the results that is acceptable before MI is rejected. This decision is made using substantive prior subject knowledge and implicitly includes an assumption about the acceptable degree of invariance. The Bayesian approach formalises the same substantive knowledge into the prior that can therefore be specifically tested.
When assessing for longitudinal invariance in the English Longitudinal Study of Ageing (ELSA), we encountered several of the aforementioned problems with conventional MI testing. The sample size is large, therefore the χ 2 test likely to be overly conservative (Chen, 2007;Cheung & Rensvold, 2002;Steptoe, Breeze, Banks, & Nazroo, 2013 Response rates at each wave were 70% at Wave 1, 82% at Wave 2, 73% at Wave 3, 74% at Wave 4, and 80% in Wave 5 (Steptoe et al., 2013). After the exclusion of extreme values (see below), final sample sizes at each wave were n = 11,951 in Wave 1, n = 9,313 in Wave 2, n = 7,850 in Wave 3, n = 6,911 in Wave 4, and n = 6,535 in Wave 5.

| Cognitive measures
The cognitive tests were performed by computer-assisted interview.
Orientation to time was assessed by asking the participant to name the day, year, month, and date. To assess immediate and delayed veral recall, 10 common words were played to participants (Steel, Huppert, McWilliams, & Melzer, 2004). Immediate recall is assessed straight away, and delayed recall of the word list was tested after the other cognitive tests were undertaken (this also serve as a distraction technique). The word lists used were randomly assigned, and a standardised recording was used for all participants.
The prospective memory task required participants to remember to write their initials in the top corner of a page they were handed.
Participants were prompted if they did not complete the actions spontaneously. A binary variable was used for remembering the correct action (either prompted or spontaneous). Semantic (category) fluency was assessed by asking participants to name as many animals as they can in 1 min. All the nonbinary variables were transformed to z-scores for the purpose of inclusion in the factor structure.

| Statistical analysis
Initially Other missing data were considered missing at random, which is as a property of the Bayesian estimation (Chen & Ibrahim, 2014).

Research on how missingness affects longitudinal invariance has only
been implemented in a single study using full information maximum likelihood and, while a topic warranting further investigation, is beyond the scope of this analysis (Sterba, 2017).
Initial exploratory factor analysis (EFA) and confirmatory factor analysis (CFA) assuming invariance were performed as part of an earlier study currently in submission. Two of the factors from this, attention (loaded onto by orientation questions) and memory (loaded onto by immediate and delayed recall, prospective memory, and verbal fluency), were used. The model was specified using CFA with configural invariance and modification indices checked to see if there was any need to make additional modifications beyond the basic factor structure (Muthén & Muthén, 2014). This identified that allowing residual covariance over time in verbal fluency and within factor covariances for immediate and delayed recall resulted in substantially improved model fit. This improved model was then tested using the χ 2 test and CFI for MI.
Next, the Bayesian approximate MI model was specified. A prior variance of~N(0,0.01) for all differences between loadings, intercepts, and thresholds at each wave with the mean across all waves was used.
The MPlus default noninformative priors were used for all other model parameters (Muthén & Asparouhov, 2011). The conclusions about the level of MI in the data were then compared between frequentist χ 2 test and CFI and Bayesian approximate MI.
The primary analysis was run for all ages in the ELSA data. Sensitivity analyses were run using age bands to check for one possible source of longitudinal noninvariance. Though there was slightly less noninvariance for older participants, and slightly more for younger participants, the overall pattern of results was very similar for all ages.
Due to this, they are not presented here.
The data were edited using Stata version 12, and the structural equation modelling was performed using MPlus version 7.0 (Muthén & Muthén, 2014;StataCorp, 2011). Markov Chain Monte Carlo estimation was utilised with the MPlus default Gibbs sampler and convergence criterion, 105,000 iterations (of which the first 55,250 are burnin) and no thinning (Muthén & Asparouhov, 2011).

| RESULTS
The participants at Wave 1 were 55.7% female, had a mean age of 64.2, and 2.8% of the sample were of non-White ethnicity ( Table 1).
The large minority of participants were retired (47.7%), and the majority of the rest of the sample worked as either employed (28.1%) or self-employed (5.7%). Most participants were married (56.2% first marriage; 11.1% remarried). The modal educational attainment was no qualifications (41.7%) with 11.5% having attained a degree. There was a bimodal distribution of social class with the largest groups being Class 5 (manual and routine occupations; 35.0%) and the second largest Class 1 (managerial and professional roles; 29.7%). The approximate MI results found that there was one parameter in the attention factor that showed a minor degree of noninvariance (Table 4); the 1st wave loading for recall of the day (0.326) that is 0.036 less than the mean loading across all waves (0.362); this was a statistically significant difference based on the 95% credible interval. This is not likely to have a substantively important impact on the results of longitudinal analysis.
For the memory factor, there is only one noninvariant loading; the Wave 4 verbal fluency loading (0.927) that is 0.029 greater than the mean across all waves (0.898). However, 9 of the 20 intercepts and thresholds are noninvariant. For immediate recall, the 2nd (0.052 above the mean), 3rd (0.032 above the mean), and 5th (0.057 below the mean) loadings show significant noninvariance. For delayed recall, the 2nd (0.036 above the mean) and 3rd (0.053 above the mean) occasions are noninvariant. In verbal fluency, the 2nd measurement occasion is estimated as being 0.009 above the mean. For prospective memory task, the threshold on the 1st occasion is 0.069 above the

| DISCUSSION
When analysing cognitive function data from ELSA, we encountered a situation where different recommendations for using the CFI to establish MI led to different conclusions. We sought to use approximate MI to provide an alternative method of deciding which level of MI to accept or reject. In this case, the approximate MI approach identified small but significant noninvariance in the loadings of the memory and attention factors that was not identified by the use of CFI (which did not reject weak invariance). However, the degree of invariance in loadings that was identified using approximate MI but missed by CFI was relatively trivial. This suggests that the assumptions of strong longitudinal MI in the attention factor and weak invariance in the memory factor are plausible.
The main source of longitudinal noninvariance was not in the factor loadings but the intercepts of the memory factor. This led to strong invariance to being rejected by both the stricter CFI criteria and approximate MI. This is particularly important because strong invariance is required to compare latent means over time and therefore necessary for longitudinal analysis. However, using alternative CFI cut-off rules for MI would have led the authors to a different conclusion about the presence or absence of strong invariance for the memory factor. Using a cut-off of −0.01 such as that recommended by Chen (2007) or Cheung and Rensvold (2002)  Moreover, as discussed by Short (2014), the truly suitable cut-off for CFI may be different again when using the specific number of time points and observed variables available. Using approximate MI revealed that there was a high proportion of noninvariant intercepts and thresholds for the memory factor caused by multiple small deviations from noninvariance. This would have been difficult to accurately identify in a step-wise fashion using a frequentist estimator.
If using factor analysis or another data reduction method, including sum scores, then ignoring this MI would have resulted in bias in the estimation of the memory factor latent mean (Muthén & Asparouhov, 2013;van de Schoot et al., 2013). In our results, the Waves 2 and 3 memory factor latent means would have been overestimated due to increases in the immediate and delayed recall intercepts. Wave 5 would have been underestimated because of decreases in the immediate recall intercept and prospective memory threshold. These effects would result in bias in both the estimation of both the rate and shape of the latent growth curve.
The noninvariance in the memory factor seems to be a combination of several isolated deviations and a linked increase in immediate and delayed recall in Waves 2 and 3. It is possible that the noninvariance seen at Waves 2 and 3 for the intercepts of immediate and delayed recall represents unequal practice effects in the indicators of this factor. The reduction in Waves 4 and 5 may represent fatiguing practice effects, an initial practice effect followed by more rapid decline in performance on those tasks or practice effects for the other indicators catching up relative to the recall tasks (Calamia et al., 2012).
Whether Bayesian MI could be used to detect non-uniform practice effects may be an avenue for further research.
The present study has the strength of using data from a high- BSEM retains the common practical problems of many types of Bayesians analysis in terms of computational intensity, challenges with assessing convergence, and unfamiliarity to many users. This is particularly the case in comparison with approaches to identifying MI such as straightforwardly comparing parameters between models that assume or do not assume MI. Although this approach may provide rapid answers in some clear-cut situations, in many cases even if an acceptable difference between estimates is prespecified (e.g., 5% or 10%), the results are borderline (Flora & Curran, 2004). This approach will also be model specific if the target of interest is a predictor of growth or a distal outcome and the additional information about invariant parameters will not be obtained, unlike with approximate MI.
Approximate MI, although not a panacea, is designed to handle multiple small invariances, and its power to detect noninvariance is not known to be affected by changing the number of groups or occasions being compared, which provides substantial flexibility. As such, it may be useful for future researchers to consider when testing the measurement properties of their instruments in longitudinal research.
With regard to ELSA specifically, we find an attention factor that essentially shows strong MI over time but only weak invariance for a memory factor. Although the degree of noninvariance was relatively small, it was on a large number of parameters and therefore, researchers may wish to either avoid using the memory factor for longitudinal research or accommodate the noninvariance using approximate or partial MI.