Data from complex surveys are being used increasingly to build the same sort of explanatory and predictive models as those used in the rest of statistics. Unfortunately the assumptions underlying standard statistical methods are not even approximately valid for most survey data. The problem of parameter estimation has been largely solved, at least for routine data analysis, through the use of weighted estimating equations, and software for most standard analytical procedures is now available in the major statistical packages. One notable omission from standard software is an analogue of the likelihood ratio test. An exception is the Rao–Scott test for loglinear models in contingency tables. In this paper we show how the Rao–Scott test can be extended to handle arbitrary regression models. We illustrate the process of fitting a model to survey data with an example from NHANES.

Aalen's nonparametric additive model in which the regression coefficients are assumed to be unspecified functions of time is a flexible alternative to Cox's proportional hazards model when the proportionality assumption is in doubt. In this paper, we incorporate a general linear hypothesis into the estimation of the time-varying regression coefficients. We combine unrestricted least squares estimators and estimators that are restricted by the linear hypothesis and produce James-Stein-type shrinkage estimators of the regression coefficients. We develop the asymptotic joint distribution of such restricted and unrestricted estimators and use this to study the relative performance of the proposed estimators via their integrated asymptotic distributional risks. We conduct Monte Carlo simulations to examine the relative performance of the estimators in terms of their integrated mean square errors. We also compare the performance of the proposed estimators with a recently devised LASSO estimator as well as with ridge-type estimators both via simulations and data on the survival of primary billiary cirhosis patients.

]]>Probabilistic matching of records is widely used to create linked data sets for use in health science, epidemiological, economic, demographic and sociological research. Clearly, this type of matching can lead to linkage errors, which in turn can lead to bias and increased variability when standard statistical estimation techniques are used with the linked data. In this paper we develop unbiased regression parameter estimates to be used when fitting a linear model with nested errors to probabilistically linked data. Since estimation of variance components is typically an important objective when fitting such a model, we also develop appropriate modifications to standard methods of variance components estimation in order to account for linkage error. In particular, we focus on three widely used methods of variance components estimation: analysis of variance, maximum likelihood and restricted maximum likelihood. Simulation results show that our estimators perform reasonably well when compared to standard estimation methods that ignore linkage errors.

Parametric confidence intervals are given for linear combinations of the means of independent Poisson variables and for their continuous versions. The performance of the intervals is assessed using simulation. A real data set is used to compare the proposed intervals with known ones. The proposed intervals are shown to be superior to known ones and comparable to exact intervals.

In this paper, we introduce linear modeling of canonical correlation analysis, which estimates canonical direction matrices by minimising a quadratic objective function. The linear modeling results in a class of estimators of canonical direction matrices, and an optimal class is derived in the sense described herein. The optimal class guarantees several of the following desirable advantages: first, its estimates of canonical direction matrices are asymptotically efficient; second, its test statistic for determining the number of canonical covariates always has a chi-squared distribution asymptotically; third, it is straight forward to construct tests for variable selection. The standard canonical correlation analysis and other existing methods turn out to be suboptimal members of the class. Finally, we study the role of canonical variates as a means of dimension reduction for predictors and responses in multivariate regression. Numerical studies and data analysis are presented.

Variational Bayes (VB) estimation is a fast alternative to Markov Chain Monte Carlo for performing approximate Baesian inference. This procedure can be an efficient and effective means of analyzing large datasets. However, VB estimation is often criticised, typically on empirical grounds, for being unable to produce valid statistical inferences. In this article we refute this criticism for one of the simplest models where Bayesian inference is not analytically tractable, that is, the Bayesian linear model (for a particular choice of priors). We prove that under mild regularity conditions, VB based estimators enjoy some desirable frequentist properties such as consistency and can be used to obtain asymptotically valid standard errors. In addition to these results we introduce two VB information criteria: the variational Akaike information criterion and the variational Bayesian information criterion. We show that variational Akaike information criterion is asymptotically equivalent to the frequentist Akaike information criterion and that the variational Bayesian information criterion is first order equivalent to the Bayesian information criterion in linear regression. These results motivate the potential use of the variational information criteria for more complex models. We support our theoretical results with numerical examples.

Early generation variety trials are very important in plant and tree breeding programs. Typically many entries are tested, often with very few resources available. Unreplicated trials using control plots are popular and it is common to repeat the trials at a number of locations. An alternative is to use partially replicated (p–rep) designs, where a proportion of the test entries are replicated at each location. We extend a method for the generation of p–rep designs based on *α*–arrays to allow for a much broader class of designs to be constructed. Updating procedures for the average efficiency factor and its upper bound are developed for application to the computer generation of efficient p–rep designs.