## 1 Introduction

The creation and validation of a new PRO questionnaire typically requires a team of experts, among them behavioral scientists who analyze the questionnaire data. However, the analytic tasks may become the responsibility of the biostatistician if a behavioral scientist is not available or if the behavioral scientist has limited training in the more sophisticated item response theory (IRT) methods [1-5]. Recent publications on health-related quality of life (HRQOL) assessments indicate the increasing use of IRT as the primary statistical method in developing and evaluating PRO instruments [6-8]. IRT models have also been used to address novel research questions in medicine and bioinformatics (e.g., [9] and [10]). Biostatisticians can easily reach new research territories by incorporating IRT to their data-analytic repertoire. This tutorial aims to provide a step-by-step guide in carrying out basic IRT analyses using freely available software programs r [11] and WinBUGS [12].

In December 2009, the US Food and Drug Administration (FDA) announced the final guidance on using PROs as part of clinical trials and for drug labeling [13], 3 years after the announcement of a draft guidance [14]. In 2005, the European Medicines Agency published an European guidance document in the evaluation of medical products in cancer by PROs [15]. The FDA guidelines are the regulatory agency's attempt to standardize the procedures in the creation, refinement, validation, and clinical use of psychometric instruments in clinical trials and drug labeling. These guidelines aim to provide practical advice to researchers, including how to define the PRO domain(s) to be measured (e.g., Section III.C on conceptualizing the PRO constructs of interest), how to write survey items, how to decide what response format is most appropriate, and how to evaluate patient understanding. As part of ‘item analysis’, survey items may be deleted or modified in response to patient understanding and preliminary data analysis. This tutorial focuses primarily on item analysis from a Bayesian IRT perspective. Item analysis is typically iterative—several revisions may be required before the draft survey instrument is finally deemed valid (as per Section III.E), reliable, and responsive to changes in well-being, just to name a few of FDA's recommendations. However, the FDA guidelines offer no explicit recommendations on how to carry out these analyses, although the conceptual diagram hints at a factor analysis framework [13] ,Figure 4. Most of our biostatistician colleagues are familiar with factor analysis but not with IRT. This motivated our writing of this article to cover item analysis using IRT.

Item response theory modeling is also useful in its own right. For example, IRT has been applied in analyzing inter-rater agreement data in rating the severity of hip fractures [9], and in microarray gene expression analysis to identify clusters of genes that are related to drug response in acute leukemias [10], Extensions of the classical Rasch model (RM) [16] have also been applied in identifying clusters of students with discrete levels of latent academic achievements. More generally, the RM is closely related to the conditional logit model [17] and the conditional logistic regression model for binary matched pairs [18] ,Chapter 10. Despite their versatility, IRT models have yet to gain wider use in biostatistics. This is in part because the command syntax of popular IRT software programs can be arcane for new users (for a list of packages, see [2]). The occasional user of IRT may be hesitant in investing the time and effort in learning it. We hope to facilitate the use of IRT models in this tutorial—a distillation of the cited sources into a practical guide using freely available statistical programming languages so that the readers can immediately apply these analytic skills in their own research.

Our primary goal is to guide the readers in applying their Bayesian analytic skills to a previously unfamiliar area of statistics. (Thus, we provide details on how an IRT model is derived.) We also hope that this article is equally useful to statisticians who are quite familiar with IRT and/or psychometrics but are new to a Bayesian analytic approach to IRT modeling. (Thus, we provide details on Bayesian computation.) The overall plan is to provide enough mathematics in both IRT model derivations and Bayesian computing so that they can be quickly deployed in practice. Muraki [19] provided a worked example on the generalized partial credit model (GPCM). The GPCM is among several commonly used models in analyzing items with polytomous response categories [6]. This article is not about choosing one model out of alternative IRT models. Interested readers can find them elsewhere [20, 21]. The deviance information criterion by Spiegelhalter *et al.* [22], which is calculated as part of default output from R2WinBUGS, is useful in model selection. What we lack in breadth, we hope to compensate for in depth.

We organize this paper as follows. Section 2 covers the theories behind widely used IRT models, including the RM [16] for binary responses and the partial credit model (PCM) [23] and the GPCM [19] for polytomous item responses. In Section 3, we develop the GPCM model from a Bayesian perspective. We do not go into the details on how to manually carry out Gibbs sampling in IRT but provide a list of references for interested readers. Section 4 translates the GPCM model mathematics into WinBUGS syntax. We assume that the readers have gotten to the point of successfully installing r, the R2WinBUGS package in r, and WinBUGS on a computer platform of their choice. In Section 5, we illustrate on how to diagnose the convergence of iterative sampling. Sections 3–5 are the main focus of this paper. They cover, in detail, how to fit the GPCM using r and WinBUGS. Section 6 focuses on the practical aspects of item analysis, on how to decide which questionnaire items should be modified or deleted. Our overall pedagogical plan is to provide enough mathematical rigor on IRT so that readers can acquire a working knowledge of IRT without the need to review the vast psychometric literature spanning several decades. Finally, we discuss in Section 7 how these steps can be used to address the statistical considerations outlined in the FDA guidelines.