Optimal sampling in retrospective logistic regression via two-stage method



Case–control sampling is popular in epidemiological research because of its cost and time saving. In a logistic regression model, with limited knowledge on the covariance matrix of the point estimator of the regression coefficients a priori, there exists no fixed sample size analysis. In this study, we propose a two-stage sequential analysis, in which the optimal sample fraction and the required sample size to achieve a predetermined volume of a joint confidence set are estimated in an interim analysis. Additionally required observations are collected in the second stage according to the estimated optimal sample fraction. At the end of the experiment, data from these two stages are combined and analyzed for statistical inference. Simulation studies are conducted to justify the proposed two-stage procedure and an example is presented for illustration. It is found that the proposed two-stage procedure performs adequately in the sense that the resultant joint confidence set has a well-controlled volume and achieves the required coverage probability. Furthermore, the optimal sample fractions among all the selected scenarios are close to one. Hence, the proposed procedure can be simplified by always considering a balance design.