Computationally efficient methods for fitting mixed models to electronic health records data

Motivated by two case studies using primary care records from the Clinical Practice Research Datalink, we describe statistical methods that facilitate the analysis of tall data, with very large numbers of observations. Our focus is on investigating the association between patient characteristics and an outcome of interest, while allowing for variation among general practices. We explore ways to fit mixed effects models to tall data, including predictors of interest and confounding factors as covariates, and including random intercepts to allow for heterogeneity in outcome among practices. We introduce: (1) weighted regression and (2) meta-analysis of estimated regression coefficients from each practice. Both methods reduce the size of the dataset, thus decreasing the time required for statistical analysis. We compare the methods to an existing subsampling approach. All methods give similar point estimates, and weighted regression and meta-analysis give similar standard errors for point estimates to analysis of the entire dataset, but the subsampling method gives larger standard errors. Where all data are discrete, weighted regression is equivalent to fitting the mixed model to the entire dataset. In the presence of a continuous covariate, meta-analysis is useful. Both methods are easy to implement in standard statistical software.


Introduction
Routinely collected datasets including electronic health records and other administrative datasets are becoming increasing widely used in healthcare research. These data sources can offer a number of advantages over traditional designed data sources such as randomised trials and surveys, for example speed of access, richness of data recording, extended longitudinal measurements and low cost.
Traditionally designed studies are often limited in size and scope; they may exclude patients with comorbidities and the elderly, and patients may decline to participate. Using routine data means that research can be done on a much wider population and therefore with much greater precision, often using hundreds of thousands of patient records to answer questions about what is happening in the "real world".
In the UK, development of electronic health records databases has been aided by the state-funded National Health Service (NHS); virtually all UK residents are registered with a primary care based general practitioner (family doctor) who provides care to all ages and acts as a gatekeeper to specialist secondary care services. Importantly, there is near universal adoption of clinical computer systems by general practices. The Clinical Practice Research Datalink (CPRD) is a government supported initiative to provide researchers with access to NHS data in a secure and ethical way. The CPRD database represents one of the largest longitudinal primary care datasets in the world 1 , collating routinely collected anonymised electronic health records data from consenting general practices on a monthly basis. As of March 2015, CPRD included over 11.3 million patients from 674 general practices, representing approximately 7% of the UK population. 1 Electronic health records such as those provided by CPRD are important in shaping public health policy. They are used to identify patients at risk of diseases; monitor the safety of medicines and vaccinations; and understand the effectiveness of treatments in different groups of patients. For this reason it is essential that studies using CPRD data are carried out to a high standard. In practice, however, routine datasets are subject to a number of challenges because they have not been collected with a specific research question in mind. These challenges include heterogeneity among outcomes, missing data and confounding; if left unaddressed, these could result in bias and incorrect inferences.
Methods such as generalized linear mixed models, multiple imputation 2 and propensity score adjustment 3 can be used to address these limitations. However, computational constraints can limit their applicability to datasets comprising very large numbers of observations. Datasets extracted for the purposes of research are usually of a size that can be stored in a standard desktop computer's memory, but it can take a number of hours to fit standard statistical models to data containing a very large number of observations, also known as tall data. This is less of an issue when one wants to fit a single statistical model, but in practice researchers using routinely collected data often fit several models, for example to compare different statistical models, investigate effect modification and conduct sensitivity analyses. As datasets are increasing in size and statistical models are becoming more complex, there is a need for computationally efficient methods for analysing tall health data.
Recently, new statistical and/or computational methodologies have been proposed that scale problems to a reasonable size 4 . These include the "divide and conquer" approach, where the data are divided into subsamples, the subsamples are analysed in parallel and the results are then combined across subsamples. 5,6 Other methods include the use of experimental design techniques to extract a representative subsample 7 and the "bags of little bootstrap" approach 8 , where results from bootstrapping a number of subsamples are averaged to give the effect estimate of interest. These methods would be difficult for applied researchers with only a basic understanding of statistics and programming to understand and implement.
Our focus is on fitting statistical regression models to routine clinical data, where the computing time required to fit such models is long due to a large number of observations and heterogeneity in outcome among very many healthcare institutions (such as general practices). We explore the use of weighted regression and meta-analysis techniques; these statistical methods are likely to be familiar to applied researchers in healthcare research, though not for the purpose of analysing tall routine datasets. We compare these methods to a recently proposed subsampling approach. 7 The methods are applied to electronic health records data from two CPRD studies, where the interest lies in investigating the association between some patient characteristics and a health-related outcome. We describe the advantages and disadvantages of subsampling, weighted regression and meta-analysis approaches, and identify the settings in which these methods would be most useful. R 9 computing code for all methods is available in Supporting Information.
The rest of the paper is set out as follows. In section 2 we introduce two case studies using electronic health records data from CPRD for research. In section 3, methods for tall data analysis are described, including an existing subsampling approach, weighted regression and meta-analysis approaches. The use of these methods is illustrated in section 4 through application to the case study examples described in section 2. We conclude with a discussion in section 5.

Case study 1
Electronic health records data from CPRD were used to investigate the impact of the UK primary care payment-for-performance system (Quality and Outcomes Framework, QOF) on the detection and treatment of cardiovascular risk factors in patients with severe mental illness. 10 This retrospective open cohort study included 427,190 patients of at least 35 years of age (67,239 with severe mental illness (SMI); 359,951 without SMI). In the original study, six different binary outcomes were considered, but in this paper we focus on recording of elevated cholesterol, defined as a binary indicator of first recording of serum cholesterol ≥5mmol/L. The dataset used for statistical analysis comprises 2,116,948 observations recorded by 674 general practices in the UK between 1996 and 2014. There are two interventions of interest: the first was a QOF incentive in 2004 for annual review of physical health in patients with SMI, and the second was a revised incentive in 2011 specific to cardiovascular review.
Our analyses were carried out as pre-specified in the original study protocol before data access. We used an interrupted time series analysis, as illustrated in Figure 1. The targets of inference are the changes in intercept and slope in years 2004 and 2011. We chose to use a logistic regression model, for its preferable interpretation as modelling log odds. In the logistic regression model we analysed the binary outcomes y, allowing separate intercepts for the cases with SMI and controls without SMI.
The parameters of interest are: β 8 , the change in intercept at year 2004 for the SMI vs. non-SMI group; β 9 , the change in intercept at year 2011 for the SMI vs. non-SMI group; β 10 , the change in slope after year 2004 for the SMI vs. non-SMI group; β 11 , the change in slope after year 2011 for the SMI vs. non-SMI group. Also of interest is the heterogeneity among general practices in the preintervention intercept β 0 , since it was hypothesised that baseline data for practices would show important variation. To allow for heterogeneity in outcome among 674 general practices, we included random intercepts b=(b 1 ,…,b 674 ) in the model. Adjustment was made for age (as a continuous variable) and gender as confounding factors in the model.

Case study 2
A separate subset of CPRD data was used to investigate the association between multi-morbidity (the co-occurrence of two or more health conditions) and health service utilisation in a cohort of 403,985 adult patients from 404 general practices in England followed for a period of four years, starting in 2012. 11 Outcomes of interest were rates of primary care consultations, prescriptions of medications and hospital admissions. Here we analyse rates of consultations among 349,785 adult patients from 353 general patients. We removed data from 51 practices with zero recordings of consultations because we believe that consultation data are missing for these practices.
We used a negative binomial regression model to model the number of primary care consultations, and included three covariates: age (as a continuous variable), gender and the number of morbidities, categorised into groups of low (0-1), moderate (2-3), high (4-5), very high (6+). An offset variable was used to define the exposure period. It has been shown that there are large variations in the recording of data among practices 1 . Variation between practices was therefore of interest and accounted for by the inclusion of a random intercept for each of 353 general practices in the model.

Methods
The problem common to our two case studies was the length of time required to fit non-linear mixed effects regression models to data comprising hundreds of thousands of observations nested within hundreds of general practices: 11 hours in case study 1 and 1.7 hours in case study 2, using R (version 3.3.1) 9 on a computer with a 64-bit Windows operating system, Intel Core i7 processor, 3.4GHz speed and 16 GB installed memory (RAM).
In this paper we reproduce analyses that have been carried out in two applied research studies using electronic health records data from the CPRD. Both studies used a generalised linear mixed model 12 in the statistical analysis. Here we describe the general form of this model.
Observations on the ith of N units consist of an outcome y i and vectors x i and z i of explanatory variables associated with the fixed and random effects. Units in health records data are typically clustered, for example patients may be grouped within general practices located in regions, and there can be multiple measurements on the same patient. We suppose that, given a q-dimensional vector b of random effects, the outcomes y i are conditionally independent. The conditional mean of y i is related to the linear predictor through a link function g: where y is the vector of outcomes (y 1 ,…,y n ) T , X and Z are the design matrices with rows x i T and z i T , and β is a p-dimensional vector of fixed effects.
We assume that b has a multivariate normal distribution with mean 0 and covariance matrix Σ.
Generalised linear mixed effects models have become relatively straightforward to fit with the use of readily available computing code for standard statistical regression software, such as the glmer function for R 9 in library lme4 13 , and the melogit and menbreg commands for Stata 14 . However, standard regression software has not been developed for the purpose of analysing tall data.

Subsampling approach
Instead of analysing all the data, Drovandi et al. 7 proposed a subsampling algorithm based on using experimental design techniques to extract a subsample that is representative of the entire dataset.
This approach involves an initial learning phase, in which we learn about our model parameters by fitting the mixed effects regression model to a sample of the entire dataset. One difficulty in using design of experiments to construct a subsample is the need to specify an initial subsample. The choice of initial sample influences how many iterations of the algorithm are required in order to reach a subsample that is sufficient to precisely estimate the parameters of the regression model. For our approach, we took a random sample of 10,000 observations as our initial sample in a similar way to Drovandi et al. Alternatively, the initial sample could be empirically based, for example by ensuring that the difference in the mean outcome in the initial and full sample is small. In this paper, the size of the initial subsample is chosen such that the proposed statistical model can be fitted in reasonable time, and the sample is large enough to allow model parameters to be reasonably well estimated.
Each combination of covariates x to be included in the model is referred to as a "design" in the subsampling algorithm described by Drovandi et al. 7  A difficulty in using experimental design techniques is the need to evaluate a utility function, which is typically difficult to compute and may require numerical approximation. When calculating each expected Fisher information matrix, we treat the random effects b as nuisance parameters and fix these to zero in our implementation of the subsampling approach. This removes the integrals with respect to b in the expected Fisher information and simplifies the computation. We derive the expected Fisher information for the logistic regression and negative binomial models in Supporting Information S1.
We want to find the optimal covariate combination x* that gives the maximum value of U(x). For our purpose of parameter estimation, the expected Fisher information matrix is a sensible utility function because maximising the utility function corresponds to minimising the variance of the estimated regression coefficients β and increasing the amount of information. Since it is easier to work with a scalar than a matrix, we took the determinant of the expected Fisher information in the utility function as an estimate of overall precision, which accounts for correlations among pairs of regression coefficients.
The optimal covariate combination x* is chosen as the covariate combination x with the maximum utility U(x). We add these data to our subsample and repeat the process as follows: 1. Fit the mixed effects regression model to the current subsample 2. Evaluate the utility function U(x) at each unique covariate combination x remaining in the original dataset.

Find the covariate combination/s x*= x that maximises U(x).
4. Extract observations with covariate combination/s x* remaining in the original dataset.
5. Add these data to the subsample and return to step 1.
This process continues until we reach a desired subsample size or until we collect sufficient information required to answer the research question, e.g. when the standard errors of parameter estimates are reasonably small and there is little change in the estimated parameters with increasing iterations of the algorithm. In the applications to the case study examples, we chose to stop the process at reaching a sample size that was "fair" to this method, e.g. to allow a reasonable number of iterations of the algorithm, but choice of sample size may be based on practical considerations.
In their subsampling algorithm, Drovandi et al. 7

consider all possible combinations of covariate values
x that are available for inclusion in the subsample, irrespective of whether they were present in the full dataset or not. This helps to identify where information is lacking in the dataset. If the optimal covariate combination is not present in the dataset, the combination present in the dataset that is closest in Euclidean distance to the optimal covariate combination is chosen. In contrast, our approach is to only consider selecting covariate combinations that are present in the entire dataset, since there are likely to be many possible covariate combinations in big datasets and it would be impractical to evaluate the utility function for each of these covariate combinations.
In some datasets the number of covariate combinations present may be small relative to the size of the dataset, and therefore the number of observations per covariate combination is large. This means that at each stage of the subsampling procedure, we add a large number of observations to our current subsample and consequently only a small proportion of covariate combinations are sampled before the desired subsample size is reached. In this situation, we suggest modifying the above subsampling approach to extract a random sample of values with the chosen covariate combination x rather than all observations with the chosen covariate combination x at each stage of the subsampling procedure.
Choice of the proportion of observations to extract from the entire dataset will depend on the number of covariate combinations in the dataset; if this is very small relative to the size of the dataset, then we might want to limit the proportion of observations that we extract with the chosen covariate combination, in order to allow time for other covariate combinations to be potentially selected for inclusion in the subsample.

Weighted regression
The idea of weighted regression is to collapse the dataset such that it contains only unique observations, i.e. where no two observations are equivalent in all outcome and covariate values, and a variable indicating the number of times each observation occurs in the full sample.
The first step in this method is to create a new variable to weight our data based on the frequency of observations with the same values of outcome, covariates and any nesting variable (in our case studies general practice). Then we collapse the dataset to a smaller dataset of values of outcome, covariates, nesting variables, and weight. In other words, we remove replicated observations with the same values of outcome, covariates and nesting variables.
Where the outcome of interest is a rate as in case study 2, we also need to weight and collapse the data by the offset variable that is used to denote the exposure period. In the presence of a continuously measured covariate, we categorize the covariate values into groups, for example by inspecting the entire dataset to find sensible cut-off values or quartiles, and replace the values of that covariate by the mean value for observations with the same recorded outcome, covariate combination (including category of continuous covariate) and values of nesting variables. A larger number of groups would better reflect the variation in the continuous covariate values, but the number of groups should be small enough for the original dataset to be collapsed to a sufficiently smaller size for analysis. By assigning each observation the mean value of the continuous covariate for its group defined by covariate combination, outcome and nesting variables, the measurement error introduced by categorizing continuous covariates is a Berkson error (uncorrelated with the covariate values used in the weighted regression analysis) and hence only a very minor source of bias. 15 In the statistical analysis we use a weighted generalised linear mixed effects model, This method is easy to implement, by specifying the argument weight in the function glmer 13 for R. 9 In a dataset comprising only discrete outcome and covariate data, fitting the weighted model to the collapsed dataset is equivalent to fitting the standard model to the entire untransformed dataset.

Meta-analysis (Divide and recombine)
In applications where the data are too large to analyse all at once, it has been proposed to use divide and recombine approaches, in which we divide the data into a number of smaller samples, analyse these subsamples individually, and then combine the results of these individual analyses. 4,5 Our aim is to estimate the regression coefficients β in a multi-level model, in order to determine the association between p independent variables and an outcome of interest. We divide the data into top level units according to a variable z and fit the regression model (equation (1)) to the data from each subgroup (conditional variable division). Fitting the regression model to data from each subgroup provides estimatesˆk j β of the regression coefficients kj β (k=0,…,p) for each subgroup j of z.
We combine the regression coefficient estimates across subgroups using two meta-analysis techniques: univariate meta-analysis and multivariate meta-analysis.
In separate univariate meta-analyses of estimated regression coefficients, we assume that the estimated regression coefficients ˆk j β from each subgroup j have a normal random-effects distribution: where kj β is the true regression coefficient for each subgroup, k β is the combined regression coefficient, 2 kj s is the estimated within-subgroup variance for the regression coefficient (assumed known) and k τ 2 is the between-subgroup variance. We estimate k τ 2 for each regression coefficient k that is assumed to vary across subgroups in equation (1), and set k τ 2 to be zero for fixed regression coefficients.
Regression coefficients estimated from the same model are correlated and by meta-analysing each regression coefficient separately this correlation is ignored. This could lead to overestimated variances of the combined regression coefficients k β and biased estimates. Multivariate meta-analysis provides a solution to this problem, by summarizing all regression coefficient estimates simultaneously.
For each subgroup j, we have p+1 estimated regression coefficients denoted as j β =( 0j β ,…, β p ) ' . In a multivariate meta-analysis, we assume these estimates to have a multivariate normal (MVN) distribution: In our applications to CPRD data, we divided the data into subgroups by general practice, analysed the data from each practice separately and combined the results from each practice using metaanalysis techniques. The regression models originally fitted to each dataset assumed a random practice-level intercept and fixed covariate effects. For this reason, we estimated between-practice variance in the univariate meta-analysis of intercept estimates only and fixed the between-practice variance to zero for all other univariate meta-analyses of regression coefficients, k=1,…,p. Similarly, in the multivariate meta-analysis we only estimated the first entry in Σ corresponding to the betweenpractice variance in intercept and set all remaining entries of to zero.
We implemented univariate and multivariate meta-analyses using the metafor package 17 and mvmeta package 18 or R, respectively. A number of methods are available for estimating the meta-analysis models, including restricted maximum likelihood estimation (REML) and method-of-moments approaches. These methods primarily differ in the estimation of the between-subgroup covariance matrix Σ. In this paper we present results using REML, which is commonly used for mixed effects models because it yields unbiased estimates of variance and covariance parameters. The difficulty with likelihood-based methods is that they become computationally intensive and time consuming as the number of subgroups and regression coefficients increases. Where the number of regression coefficient estimates per subgroup is large such that computing time for estimation of the multivariate meta-analysis model is very slow, one may prefer to use method-of-moments estimation, which requires no numerical maximization or iteration and is consequently fast to implement.

Applications to case study examples
In this section we apply the above methods to the two case study examples (Section 2) to demonstrate the use of each method for fitting mixed regression models to tall routine datasets.

Case study 1
Results from fitting the logistic regression model to the entire dataset are displayed graphically in Figure 1 and we report results for the parameters of interest numerically in Table 1 Figure 2 shows the 95% confidence intervals for all regression coefficient estimates of interest at each iteration of the subsampling algorithm. As expected, the confidence intervals tend to narrow with each additional iteration, as more data are included in the analysis. In total we selected 93 covariate combinations x for inclusion in our subsample; Table 3 (Table 1).
Meta-analysis is the fastest approach to implement, taking just 68 seconds. In this example, correlation between the covariate time and the intercept, and between the indicator of intervention 1 and the intercept, causes the univariate meta-analysis of intercept estimates to yield a higher betweenpractice variance τ 2 estimate of 2.68, compared with the standard approach of analysing the entire dataset (τ 2 =0.05). The multivariate meta-analysis approach incorporates the correlation between main effects and hence yields the same estimate of 0.05 for τ 2 as the conventional analysis of the entire dataset.
The estimated risk differences obtained through conventional analysis of the entire dataset are almost identical to those derived using weighted regression, and results based on subsampling and metaanalysis techniques are similar ( Table 1). The standard errors under the subsampling approaches are reasonably small compared with those derived from analyses of the entire dataset. CI: 2.10 to 2.14).

Case study 2
Also displayed in Table 4 are results from the different methods described in section 3. At each iteration of the subsampling algorithm, we assume that only the observed combinations of the covariate levels (formed by inspecting the entire dataset) shown in Table 2  Using meta-analysis to combine the estimated regression coefficients across practices leads to very close estimates for the regression parameters of interest, compared to the conventional approach of analysing the entire dataset.
Overall, point estimates for rate ratios derived from the conventional analysis of the entire dataset are close to those derived through alternative approaches using weighted regression and meta-analysis techniques (Table 4). Results from the subsampling approach lead to very similar conclusions of associations between covariates and the outcome of interest, although there are some discrepancies in parameter estimates. The standard errors under the subsampling approaches are reasonably small compared with those obtained through analysing the entire dataset. The directions of the effect estimates are the same across methods, for all rate ratios.

Discussion
Fitting mixed effects regression models has become relatively straightforward with the use of readily available computing code for standard statistical software such as R or Stata. However, it is time consuming and impractical to fit such models to tall data comprising hundreds of thousands of observations nested within hundreds of general practices. We have described how weighted regression and meta-analysis can be used for the purpose of analysing tall routine datasets. We have compared these methods to an existing subsampling approach 7 , through application to electronic practice was very small relative to the size of our example datasets and so it took over 24 hours to proceed through the subsampling algorithm.
Another limitation of the subsampling method is that we consider all combinations of covariate values observed in the entire dataset for inclusion in our subsample. This would not be practical in other datasets with very many variables, since there are likely to be many possible covariate combinations that are available for selection from the dataset. In datasets where some values of covariates are not very frequent, the approach for extracting covariate combinations could be improved by extracting covariate combinations that occur within a specified window. 7,19 This will be most relevant for datasets with continuously measured variables.
Where it is preferable to use all of the data, we have found weighted regression and meta-analysis techniques to perform well. Both methods involve fitting the regression models to smaller datasets, thus reducing the time required for statistical analysis. In case study 1, the weighted regression approach reduced the size of the dataset substantially to 13% of its original size. Thus, this method facilitates analysis in software that currently cannot handle tall data, for example enabling Bayesian analyses in WinBUGS. 20 Although weighted regression is useful, there would be little gain in using this approach where the outcome of interest is very variable. Before attempting this approach we recommend checking that the number of possible outcome values is small relative to the size of the entire dataset, so that it is possible to collapse the dataset to a much smaller size. For similar reasons, weighted regression is limited in the presence of a continuously measured covariate. Our approach was to categorise any continuous covariate into fairly broad groups before collapsing the dataset to a smaller size, and to use the mean covariate value for each group in the regression analysis. This approach performed well in the applications to data from both case studies 1 and 2, but it does introduce measurement error and may yield larger standard errors for regression coefficients in other datasets. A further limitation of weighted regression is that it scales badly, becoming increasingly slow as the number of covariates increases.
In the presence of a continuous covariate, we have found meta-analysis techniques to be useful and we expect that this would also be the case where the outcome is continuously measured. For our purpose of fitting a regression model with a practice-level effect, meta-analysis approaches were fast to implement because the division of the data by practice removed the need for the practice-level effect in the analysis of data from each subgroup. We note that computing time could be reduced further by combining the weighted regression and meta-analysis approaches.
An important feature of the meta-analysis approach is that one may not be estimating the same quantity as a complete data analysis, especially for non-linear link functions. In particular, a single model would typically be specified so that the effects of the confounding variables are taken to be the same across all subgroups. In contrast, the meta-analysis approach allows confounder effects to vary across subgroups and, as such, provides a more flexible approach to the control of confounding. This is a reason for some differences between estimates obtained through conventional analysis of the entire dataset and meta-analysis of practice-level results. In future work we plan to consider methods to estimate a common treatment effect while allowing confounder effects to vary across subgroups.
Our case studies were carried out as pre-specified in the original study protocols. We focussed on fitting two-level models to data comprising observations (level-one units) nested in general practices (level-two units). With more than two levels, the meta-analysis approaches could be used to analyse the data. One approach could be to divide the data into level-two units (e.g. practices), and recombine in multiple stages (one stage per level), or to divide into top level units (e.g. regions). In our datasets, it was natural to divide the data by practice, but there may not always be an obvious variable on which to partition the data. The use of experimental designs could assist in this. 7,19 A limitation of the meta-analysis approaches is that we could not handle cross-classified data which may arise in studies of hospital patient outcomes, if there is interest in allowing both for betweenhospital variation and between-practice variation. Another disadvantage of the meta-analysis approaches is that it would be impossible to fit a regression model to data from practices that show no variation in the outcome of interest. In this case we could divide the data at a higher level, for example by region, then fit a mixed model to data from each practice and combine random practicelevel effects and fixed covariate effects across regions using meta-analysis techniques.
Univariate meta-analysis would be expected to perform reasonably well when the covariates are centred and uncorrelated. In case study 1, the between-practice variance in intercept was higher in the univariate meta-analysis than in the conventional analysis of the entire data. For the purpose of examining association between some patient characteristics and an outcome of interest, estimation of the intercept and random-effects distribution is not important. Further investigation is needed to determine the use of meta-analysis for other purposes, such as prediction modelling.
In the analysis of longitudinal data it may be required to fit models with more than one random effect, for example at the patient and practice levels. The meta-analysis approach could be used for this purpose. Before attempting this analysis of routine data, it is important to think about the longitudinal value of the data which has not been collected for the purpose of research. 21 Developing methods for the analysis of tall longitudinal data should form the subject of future work.
The primary aim in each case study is to estimate the effect of independent variables on an outcome of interest in the population. For the purpose of population-based inference, marginal models fitted by generalized estimating equations would have been less sensitive to parametric assumptions than likelihood-based methods and would have been computationally more efficient. 22 The computationally efficient methods described in this paper would facilitate fitting marginal models in the same way they have assisted in fitting generalised linear mixed models to tall data. We chose not to focus on marginal models here. In the protocols for the case studies considered in this paper it was of interest to allow for further nesting of observations within regions as well as practices, and this is something that is less straightforward to account for in marginal models. Further, marginal models are less useful for prediction, which is of interest in many studies using health records data, since no distribution is specified for the outcome of interest.
Findings based on analysis of routine data which have not been collected for the purpose of research are likely to be affected by numerous biases. Routine data may provide incomplete information which would need explicit strategies such as multiple imputation and data linkage to deal with. There is a further danger of selection bias, for example due to disease and event definitions relying on code groups and algorithms that are typically less accurate than prospectively applied definitions.
Adjusting for multiple biases in the analysis of routine data would greatly increase the complexity of the mixed model and computing time. The methods described in this paper could potentially assist in this; a representative subsample could be used for methodological development and meta-analysis might facilitate the application of bias-adjustment methods to the entire dataset.
In summary, we have identified scalable methods for the analysis of routine datasets comprising a very large number of observations. The existing subsampling approach allows us to extract the data required to answer the research question and has performed well in example applications. However, it may be preferable to make use of all the data. Where all data are discrete, weighted regression is equivalent to fitting the model to the entire dataset. In the presence of a continuously measured covariate, we have found meta-analysis techniques to be very useful. Both weighted regression and meta-analysis approaches are accessible to applied researchers and are easily implemented in standard statistical software.  -1.5,-1,-0.5,0,0.5,1,1.5,2,2.5 0,1 0,1 0,1 0,1 Table 3 The first ten covariate combinations selected at each iteration of the existing subsampling approach.

Covariate
Iteration of algorithm  RR rate ratio; SE standard error. *We used the mean age for each group defined by values of outcome, covariate values, practice and the offset variable, time in the study. †Each row corresponds to a separate univariate meta-analysis of regression coefficient estimates. ‡Dispersion parameter estimates weighted according to the size of the practice and averaged across practices. §Average of three measurements of the time taken to implement the procedure using R (version 3.3.1) 9 on a computer with a 64-bit Windows operating system, Intel Core i7 processor, 3.4GHz speed and 16 GB installed memory (RAM).