The creation and validation of a new PRO questionnaire typically requires a team of experts, among them behavioral scientists who analyze the questionnaire data. However, the analytic tasks may become the responsibility of the biostatistician if a behavioral scientist is not available or if the behavioral scientist has limited training in the more sophisticated item response theory (IRT) methods [1-5]. Recent publications on health-related quality of life (HRQOL) assessments indicate the increasing use of IRT as the primary statistical method in developing and evaluating PRO instruments [6-8]. IRT models have also been used to address novel research questions in medicine and bioinformatics (e.g.,  and ). Biostatisticians can easily reach new research territories by incorporating IRT to their data-analytic repertoire. This tutorial aims to provide a step-by-step guide in carrying out basic IRT analyses using freely available software programs r  and WinBUGS .
In December 2009, the US Food and Drug Administration (FDA) announced the final guidance on using PROs as part of clinical trials and for drug labeling , 3 years after the announcement of a draft guidance . In 2005, the European Medicines Agency published an European guidance document in the evaluation of medical products in cancer by PROs . The FDA guidelines are the regulatory agency's attempt to standardize the procedures in the creation, refinement, validation, and clinical use of psychometric instruments in clinical trials and drug labeling. These guidelines aim to provide practical advice to researchers, including how to define the PRO domain(s) to be measured (e.g., Section III.C on conceptualizing the PRO constructs of interest), how to write survey items, how to decide what response format is most appropriate, and how to evaluate patient understanding. As part of ‘item analysis’, survey items may be deleted or modified in response to patient understanding and preliminary data analysis. This tutorial focuses primarily on item analysis from a Bayesian IRT perspective. Item analysis is typically iterative—several revisions may be required before the draft survey instrument is finally deemed valid (as per Section III.E), reliable, and responsive to changes in well-being, just to name a few of FDA's recommendations. However, the FDA guidelines offer no explicit recommendations on how to carry out these analyses, although the conceptual diagram hints at a factor analysis framework  ,Figure 4. Most of our biostatistician colleagues are familiar with factor analysis but not with IRT. This motivated our writing of this article to cover item analysis using IRT.
Item response theory modeling is also useful in its own right. For example, IRT has been applied in analyzing inter-rater agreement data in rating the severity of hip fractures , and in microarray gene expression analysis to identify clusters of genes that are related to drug response in acute leukemias , Extensions of the classical Rasch model (RM)  have also been applied in identifying clusters of students with discrete levels of latent academic achievements. More generally, the RM is closely related to the conditional logit model  and the conditional logistic regression model for binary matched pairs  ,Chapter 10. Despite their versatility, IRT models have yet to gain wider use in biostatistics. This is in part because the command syntax of popular IRT software programs can be arcane for new users (for a list of packages, see ). The occasional user of IRT may be hesitant in investing the time and effort in learning it. We hope to facilitate the use of IRT models in this tutorial—a distillation of the cited sources into a practical guide using freely available statistical programming languages so that the readers can immediately apply these analytic skills in their own research.
Our primary goal is to guide the readers in applying their Bayesian analytic skills to a previously unfamiliar area of statistics. (Thus, we provide details on how an IRT model is derived.) We also hope that this article is equally useful to statisticians who are quite familiar with IRT and/or psychometrics but are new to a Bayesian analytic approach to IRT modeling. (Thus, we provide details on Bayesian computation.) The overall plan is to provide enough mathematics in both IRT model derivations and Bayesian computing so that they can be quickly deployed in practice. Muraki  provided a worked example on the generalized partial credit model (GPCM). The GPCM is among several commonly used models in analyzing items with polytomous response categories . This article is not about choosing one model out of alternative IRT models. Interested readers can find them elsewhere [20, 21]. The deviance information criterion by Spiegelhalter et al. , which is calculated as part of default output from R2WinBUGS, is useful in model selection. What we lack in breadth, we hope to compensate for in depth.
We organize this paper as follows. Section 2 covers the theories behind widely used IRT models, including the RM  for binary responses and the partial credit model (PCM)  and the GPCM  for polytomous item responses. In Section 3, we develop the GPCM model from a Bayesian perspective. We do not go into the details on how to manually carry out Gibbs sampling in IRT but provide a list of references for interested readers. Section 4 translates the GPCM model mathematics into WinBUGS syntax. We assume that the readers have gotten to the point of successfully installing r, the R2WinBUGS package in r, and WinBUGS on a computer platform of their choice. In Section 5, we illustrate on how to diagnose the convergence of iterative sampling. Sections 3–5 are the main focus of this paper. They cover, in detail, how to fit the GPCM using r and WinBUGS. Section 6 focuses on the practical aspects of item analysis, on how to decide which questionnaire items should be modified or deleted. Our overall pedagogical plan is to provide enough mathematical rigor on IRT so that readers can acquire a working knowledge of IRT without the need to review the vast psychometric literature spanning several decades. Finally, we discuss in Section 7 how these steps can be used to address the statistical considerations outlined in the FDA guidelines.
2 Models of item response data
2.1 Rasch model for dichotomized items
One of the simplest IRT models is the RM for dichotomized response data, developed by the Danish mathematician Georg Rasch . The RM handles data coded as ‘correct’/‘incorrect’ or ‘yes’/‘no’ with a value of 1 coding a correct answer or a ‘yes’ response. The log odds of answering an item correctly is a function of two parameters:
where Pr(yij = 1 | θi,βj) represents the probability of person i scoring a 1 vs. 0 on item j. The interpretation of this model is made clearer if we let θi represent person i's innate ‘ability’ and βj represent item j's ‘difficulty’. If a person's ability matches the difficulty of an item, then he or she has a 50/50 chance of answering the item correctly (assuming no guessing). This interpretation makes intuitive sense in an educational testing setting. Equation (1) can be unpacked by applying the inverse logit:
Equation (2) is more commonly used in the literature than the simpler Equation (3) .
At approximately the same time in the US, Birnbaum  and Lord  developed similar models independent of the RM [, Chapter 1]. Their models contain an additional slope parameter αj addressing the discriminating power of test items, thus the name two-parameter logistic model.
Note that the difference between person i's ‘ability’ level and item j's ‘difficulty’ level is weighted by αj. Thus, items with relatively large αj parameter values provide more precise measurement of θi, particularly for values of θi near the βj value. The αj parameter is interpreted as an index of how strong of an indicator item j is of the latent variable θ. We can view the classical RM in Equation (2) as a special case of the two-parameter logistic model in Equation (4) with fixed αj = 1 for all items.
2.2 Partial credit model and generalized partial credit model for polytomous items
The RM and two-parameter logistic models are not designed to handle multiple response categories or a mixture of yes/no and multiple responses. Masters  proposed a PCM to handle polytomous items by extending the dichotomous RM to K response categories. Master's PCM treats polytomous responses as ordered performance levels, assuming that the probability of selecting the kth category over the [k − 1]th category is governed by the dichotomous RM. For example, K = 4 if the responses are ‘Strongly Disagree’, ‘Disagree’, ‘Agree’, and ‘Strongly agree’ and are scored as 0, 1, 2, and 3, respectively. A person who chooses ‘Agree’ is considered to have chosen ‘Disagree’ over ‘Strongly disagree’ and ‘Agree’ over ‘Disagree’ but to have failed to choose ‘Strongly agree’ over ‘Agree’.
In the following discussion, we follow Masters' derivation of the PCM [, p. 158]. For each of the successive response categories, the probability of endorsing response category k over k − 1 for item j follows a conditional probability governed by the RM:
Thus, πjk = [(πj[k − 1] + πjk) exp(θi − βjk)]/[1 + exp(θi − βjk)]. We can solve for πj0,πj2, … ,πjk (detailed derivations are found in Appendix A). Muraki  shows that the G term in the following equation makes the solutions much easier to track.
These expressions can be interpreted intuitively as though the person ‘passes through’ each of the preceding response categories before finally stopping at a response [, p. 165] that, presumably, most accurately reflects that person's standing on the latent variable continuum. The adjacent βjk parameters represent the incremental item ‘difficulties’ that the person has to step through to reach the next response category. In educational testing, the incremental item difficulties assign credits to partially correct answers, and thus the name partial credit model. In HRQOL, symptom severity may be either assessed as ‘present’ or ‘absent’ or on a gradation such as ‘persistent/intermittent/none’. Each numerator in the PCM equation represents one unique response step, and the denominator G is the sum of all possible steps. In the psychometrics literature, models that conform to these characteristics are named ‘divide-by-total’ models .
The PCM can accommodate items with different response scales in one HRQOL instrument, such as a combination of items assessed as ‘present/absent’, ‘persistent/intermittent/none’, or on a four-category rating scale as described previously. To obtain a general notation, we define that item j is scored y = 0,1,2, … ,mj with Kj = mj + 1 response categories and the denominator G. A general model expression for Kj = mj + 1 that incorporates all of the aforementioned steps is given in the PCM:
where the numerator is the individual response outcomes and the denominator is the sum of all the possible outcomes; i = 1,2, … ,N refers to individual respondents, N refers to total number of respondents in the sample, j = 1,2, … ,J refers to items, and h = 1,2, … ,k refers to the number of response categories.
Muraki  further extended the PCM into the GPCM by introducing a discrimination parameter αj for each item.
The PCM is a special case of the GPCM where all αj = 1. The parameters θi, βjh, and αj are respectively interpreted as a person's underlying health status or quality of life, the inherent health status or quality of life intensity measured by the response category thresholds, and an item's ability in discriminating between persons with different underlying health status.
3 Generalized partial credit model from a Bayesian perspective
To further reduce notation, we denote the observed item response vector as y = (yi1,yi2, … ,yiJ), the vector of ability parameters as θ = (θ1,θ2, … ,θN), and the vector of parameters for the jth item as α = αj and β = βjh. The likelihood of observing response vector y is
where θ is a random sample from N(μ,σ2) [, Chapter 16], with hyperprior distributions and σ2 ∼ Inv-Gamma(a0,b0). Model (7) is not identified. For example, a constant can be added to the value of αj or βjh without affecting the model's prediction. There is a long tradition to set N(μ,σ2) to N(0,1) to simplify model fitting [5, 28, 29] and to address the issue of model identifiability. Another common approach to solve model indeterminacy is by centering the βjh parameters by their average so that they sum to zero [, p. 450]. These constraints are not always explicitly stated in the documentations, causing potential confusions in different estimated parameter values by different software packages.
Bafumi et al.  proposed an unconventional method to identify model parameters. Estimated parameters are normalized as follows after the estimation is complete. This method has the advantage that sample-specific distribution of θ can be derived for future references (e.g., difference between cohorts of respondents).
θi's are normalized to have mean 0 and standard deviation 1. The normalized θi's are named ,
βjk's are normalized by the mean and standard deviation of θi to retain a common scale for all parameters. The normalized β's are named , and
The multiplicative αj's are multiplied by the standard deviation of θi to retain a common scale, The normalized αj's are named ,
hence retaining (subscripts omitted to simplify notations).
3.1 Priors, hyperpriors, joint posterior density, and conditional posterior densities
The prior distributions are as follows:
where N and logN represent the normal and the log-normal distributions, respectively. Our prior knowledge about the unknown θ and β parameters is represented in relatively noninformative prior and hyperprior distributions. Baldwin et al.  previously used the log-normal distribution for α to represent common findings that α is non-negative and typically near 1.0. The distribution of the latent characteristics θ is not set to a standard normal. Instead, we set hyperpriors for μ and σ2 so that they center on the standard normal. The hyperprior for μ is noninformative. The hyperprior for σ2 follows an inverse gamma distribution with a mean of 1.0 and a variance of 2 (when shape = 0.5 and rate = 0.5), which places the bulk of parameter values within the typically observed [ − 3, + 3] bounds. It also minimizes the risk of floating point overflow during model estimation by WinBUGS.
The joint posterior distribution is as follows:
where α and β are assumed independent. The next steps are to derive the conditional posterior distributions for the θ, α, and β parameters from the joint posterior distribution. By way of an example, we begin the derivations with p(θ | y, α, β) = p(θ, α, β, y) / p(y, α, β) by the axioms of conditional probability [, p. 3]. However, this remains hard to track. Note that p(θ, α, β,y) = p(y | θ, α, β)p(θ, α, β). Thus, p(θ | y, α, β) = p(y | θ, α, β)p(θ, α, β) / p(y, α, β). If we assume a priori independence of θ, α, and β, we may write this as
because the marginal p(α) and p(β) distributions for fixed α and β are constants. Note that the conditional posterior is proportional to prior times likelihood of the data—the familiar form of Bayesian inference [, p. 32]. The conditional posterior distributions for the θ, α, and β parameters, respectively, are as follows:
Sampling from the conditional distributions in Equations (9)-(11) may be carried out using a rejection sampling with Gibbs approach, which is the method implemented in the WinBUGS software program  or a Metropolis–Hastings within Gibbs algorithm (e.g., [32, 33]). Readers interested in further technical details may follow-up with other published work (e.g., [20, 29, 32-34] and [, p. 444]). The next section provides a step-by-step guide for using r and WinBUGS to estimate the GPCM parameters.
4 Steps to fit a generalized partial credit model by r and WinBUGS
4.1 Neuroticism dataset
The dataset is the bfi dataset in the r package psych , which contains the responses of 2800 subjects to 25 personality self-report items. The items map onto five putative psychological dimensions, the ‘Big Five’ personality traits : agreeableness, conscientiousness, extraversion, neuroticism, and openness. This section focuses on the five items assessing neuroticism, a tendency to easily experience anger (‘1. Get angry easily’), unpleasant affect/emotion (‘2. Get irritated easily’ and ‘3. Have frequent mood swings’), depression (‘4. Often feel blue’), and anxiety (‘5. Panic easily’).
The response categories were ‘1. Very inaccurate’, ‘2. Moderately inaccurate’, ‘3. Slightly inaccurate’, ‘4. Slightly accurate’, ‘5. Moderately accurate’, and ‘6. Very accurate’. The bfi dataset was chosen because it is suitable to examine key aspects of IRT modeling. IRT makes inferences on an unobservable underlying psychological construct, in this case a neuroticism personality trait, on the basis of each person's responses to a few straightforward items. The response categories describe what the person is like generally and do not depend on environmental, social, and interpersonal contexts. The observed responses may be viewed as a self-reported symptom severity—a higher summation score representing more severe neuroticism. IRT goes beyond simple summary scores and takes into consideration, for each item, its symptom severity threshold (βjh) and sensitivity to changes in neuroticism (αj).
4.2 r syntax
We need two syntax files, one for r and one for WinBUGS. The r syntax file prepares the data for fitting the model in the WinBUGS syntax file. The r syntax is in Listing 1. Lines 1–3 load the required packages into r. Additional required packages are loaded automatically (e.g., packages coda is required by R2WinBUGS). Lines 6 and 7 extract only the neuroticism items. Line 9 shows that we use data from the first 500 subjects and convert them into a matrix. Lines 10–12 specify the number of subjects, the total number of items, and the response category for each item. Lines 13–18 set up the data and names of parameters to be fitted by WinBUGS. Then, bugs() is called to pass the data and the Gibbs sampler parameters to WinBUGS for analysis. Lines 24 and 31 calculate the elapsed time. The bugs() function needs to know the name of the WinBUGS syntax file and settings for the iterative chains. WinBUGS generates random initial values if the init option is set to NULL. Here, we set codaPkg = TRUE to save the iterative chains for further analyses by the coda package. Line 28 sets bugs.seed = 7 for reproducibility. The results are returned back to r as neuro.bugs.
A total of 152,000 iterative simulations were carried out per chain, the first 2000 iterations discarded, and 10,000 iterations saved after thinning. This long chain was guided by convergence diagnostics (described in Section 5). One advantage of using the R2WinBUGS package is that it automatically prints out two useful statistics for each parameter estimate: (1) the effective sample size (sample size adjusted for autocorrelation across simulations); and (2) the Gelman and Rubin convergence diagnostic , shown as Rhat. Values of Rhat close to 1 indicate convergence [39-41].
4.3 WinBUGS syntax
Listing 2 shows the WinBUGS syntax. Lines 1–15 are adapted from Curtis . Line 4 shows that the item responses can be one of K[j] possible values, with the probability of each response separately specified. Line 6 samples each person's latent characteristic from N(μ,σ2) with hyperpriors and σ2 ∼ Inv-Gamma(shape = 0.5,rate = 0.5), specified by lines 25 and 26, respectively.
Lines 11–13 calculate the numerators in the GPCM model in Equation (7). The denominator is harder to track. Recall that in Equation (5), the denominator . Thus, line 13 yields δ1 when k = 1, δ2δ1 when k = 2, and so on for when k = K[j]. Line 14 is the WinBUGS representation of Equation (7). Line 17 specifies prior α with a log-normal distribution with mean m.alpha and precision pr.alpha. Line 20 specifies prior β with a normal distribution with mean m.beta and precision pr.beta. Lines 23 and 24 calculate the precision parameters from the inverse of variance parameters.
Table 1 shows that the mean parameter estimates by WinBUGS agree well with those calculated by maximum likelihood estimation.
Table 1. Parameter estimates by WinBUGS and the gpcm() function in library(ltm).
W, WinBUGS parameter estimates; g, gpcm() function in library(ltm).
Figure 1 is an empirical quantile–quantile plot of the βjh estimates derived from the maximum likelihood method in gpcm() against the Bayesian estimates. The two distributions agree well, except that, for values outside [ − 1.5, + 1.5], the Bayesian estimates are distributed slightly closer to the mean than the maximum likelihood estimates. The small differences between the two sets of estimates disappear if we set θ ∼ N(0,1) (not shown), suggesting that the normalization scales the raw parameter values slightly toward μ.
5 MCMC diagnostics
Several diagnostics for the MCMC chains can be calculated by the coda package (or boa). The analyst is advised to consult several diagnostics because each has strengths and weaknesses, as discussed in review papers by Brooks and Robers  and Cowles and Carlin .
The menu-driven tool codamenu() calculates some of them, including the diagnostics by Gelman–Rubin , Geweke , Heidelberger–Welch , and the Raftery–Lewis  run-length estimate.
The Gelman–Rubin diagnostic requires parallel chains from dispersed initial values. The idea is to compare the between-chain and within-chain variabilities. If all chains converge to a similar posterior distribution, then the between-chain variability would be small relative to the within-chain variability. A ratio is calculated from a combined variability estimate (a weighted average of between-chain and within-chain variabilities) and the within variability alone. The square root of this ratio is the potential scale reduction factor; a longer simulation is indicated with a scale reduction value greater than 1.0. The Geweke diagnostic tests whether the early and latter iterative sequences (defaults are first 10% and the latter 50%) are comparable in their averages. A z-statistic is calculated, a large value (e.g., out of the ± 2 bounds) indicate the need for a longer chain.
The Heidelberger–Welch diagnostics include a stationary and a half-width test. The stationary test uses the Cramér–von Mises statistic to successively test the null hypothesis that the sampled values come from a covariance stationary process. The whole chain is first used to calculate the Cramér–von Mises statistic. If it passes the test (null hypothesis not rejected), then the whole chain is considered stationary. If it fails the test, then the initial 10% of the chain is dropped and tested again. Then, 20% is dropped and so on, until either the chain passes the test (e.g., at 30% reduction) or the remaining data are no longer sufficient (default to 50%).
The half-width test is then applied to the part of the chain that is deemed stationary. It tests whether or not the chain provides enough data to determine the confidence interval of the mean estimate to within a specific accuracy. The accuracy measure is the ratio of half of the width of the 95% confidence interval to the mean estimate. The default accuracy is 0.1 of the accuracy of the 95% confidence interval.
The Raftery–Lewis estimate focuses on achieving a prespecified precision of specific quantiles of a chain. The default is that the 2.5th percentile of a parameter estimate must be within a precision of 0.005 quantile units with 0.95 probability. The output reports the minimum length of the burn-in period (M), the estimated number of post burn-in iterations required to meet the criteria (N), the minimum number of post burn-in iterations (Nmin, assuming zero autocorrelation), and the dependence factor I, which is N/Nmin, the relative increase in total number of iterations due to autocorrelation. Jackman [, Section 6.2] observes a few problems with the Raftery–Lewis estimate. It can be conservative. It can run into a confusing and frustrating cycle—another, much longer run length is prescribed after the analyst has carried out a run length exactly as prescribed previously.
Our WinBUGS chains have generally exceeded the Raftery–Lewis burn-in length and post burn-in iterations, with the obvious exceptions of mu0 and var0. The estimated total run lengths are markedly reduced to 10,350 and 12,714, respectively, by lowering the default 0.025 quantile to 0.02 (Details of diagnostics available upon request). Moreover, mu0 and var0 pass the Gelman–Rubin, the Geweke, and the Heidelberger–Welch diagnostics. We suspect that the present run length is satisfactory after all, given Jackman's observation  ,Section 6.2 on the problems with the Raftery–Lewis, the passing of the three other diagnostics, and the markedly reduced run length estimates when the default quantile is slightly lowered.
6 Basic item analysis and selection
An item analysis helps determine the final length and response format of a new PRO instrument. The process can be rather detailed . The FDA guidance provides no specific recommendations on how to carry it out. An important goal, however, is to reduce the draft questionnaire to a shorter version with carefully crafted items that are selectively sensitive to a wide range of latent θ. This goal can often be adequately addressed by visual inspections of IRT model properties. Two such properties are covered in this section: (1) fitted response probability curves; and (2) Fisher information curves with respect to the latent θ. These basic visual explanations are often the first steps after estimating the IRT parameters and are part of popular introductory texts [1, 2, 48]. More rigorous methods are usually found in journal articles [49, 50].
6.1 Category response curves
Figure 2 shows an example of marginal posterior density curves for all items, across a range of θ values. The r commands used to plot the curves are available upon request.
These ‘category response curves’  are often used to evaluate the need for item and scale reduction. For example, there is a visible overlap between response categories 2, 3, and 4 for all items (‘2. Moderately inaccurate’, ‘3. Slightly inaccurate’, and ‘4. Slightly accurate’, respectively). The parameter estimates are also very similar in value (Table 1). These response categories may be merged into one in the next version of the instrument. Another noteworthy pattern is that the probability of endorsing item 4 ‘feeling blue’, an indication of depressive symptoms, is lower than in other items. Low endorsement probability is not necessarily a problem, because, in this case, we know that depression symptoms are not prevalent.
Generally, the category response curves would preferably cover a range of values on the latent characteristic. Low peaks indicate low probability of endorsement and relatively poorer αj discrimination parameters. They need to be further explored. For example, there is an approximately 1 in 5 probability of responding a 3 ‘Slightly inaccurate’ in all items for persons at average levels of neuroticism (i.e., at θ ≈ 0). Categories 2 and 4 work better than category 3 because they cover the same range with much higher probabilities. Therefore, by visually inspecting all such plots for all items, we may consider merging response categories, removing items with similar properties or items with visibly undesirable properties.
6.2 Item information curves
Another frequently asked question concerns how certain we are about a person's estimated level on the latent θ continuum. This question can be addressed by visually inspecting the Fisher information of an item over a range of θ values. A higher Fisher information means lower uncertainty for the θ estimate [1, 3] and vice versa.
Figure 3 plots the item information curves across a range of θ values for all items. Item 2 (‘irritated’) is visibly the most informative item. The next most informative items capture ‘anger’ and ‘mood swings’; and ‘anger’ tends to provide the most information at very intense levels of neuroticism (i.e., θ > 1.5). These three items do not appear to provide much information for θ values outside the [ − 1.5,2.5] range. New items may be needed if measuring very low levels of neuroticism is of interest. Items 4 and 5, on the other hand, share a similar low information profile. Scientifically, these items may need to be kept. From the perspective of item information, they do not seem to contribute much to the information about θ. Neither item wins over the other in pinpointing a person's underlying neuroticism. They may be revised or replaced if alternate items are available.
This five-item example shows that redundant items can be useful in a draft survey. Some seemingly redundant items may provide information over a range of the latent characteristic not covered by other theoretically more promising items. This example lacks items that are sensitive to latent characteristics below 1.5 standard deviations of the norm. Additional items would be necessary if detecting these extreme latent values were of interest. (It is hard to conceive of a need to diagnose extremely low neuroticism levels.)
Proprietary software programs generate the Fisher information curves. However, except for the simplest cases [, pp. 70–71], the computation formulae are not easily accessible to nonexperts. This can be seen in the somewhat technical reparameterization in Muraki's  derivations for the GPCM. These complications may be attributed in part to the long time since Birnbaum's [, Chapters 17 and 20] and Samejima's  pioneering works more than 40 years ago. There ought to be an accessible approach to the information function, one which is based on Fisher's original definition . Anyone who is comfortable with Fisher's original definition should be able to follow the computation easily. That is exactly what we have attempted to do in Appendix B. We aim to provide enough details to save the readers from plowing through other sources and yet not necessarily finding the needed answers [1-3, 5, 25, 52, 54, 55].
We provide an illustrative example in this article to show how to fit a GPCM model using r and WinBUGS. The primary pedagogical goal is to include sufficient practical analytic techniques in one guide so that readers unfamiliar with IRT can immediately apply these skills in their own work. Also covered are a few basics in Gibbs sampler and in diagnosing the convergence of iterative simulation by MCMC methods. The parameter estimates agree well with those calculated by the maximum likelihood method implemented in the ltm package. This leads us to encourage the further use of r and WinBUGS to reduce the reliance on proprietary computer software programs in IRT analysis.
We focus primarily on item analysis, which is among the most technically challenging steps towards claiming a PRO instrument reliable, valid, and responsive to changes in well-being (as per Section III.E of the FDA guidance). This tutorial does not cover practical aspects of carrying out analyses of reliability, validity, and responsiveness. However, these key concepts involve statistics that are comparatively more straightforward. For example, the concept of responsiveness can be tested by comparing the PRO scores of individuals whose symptoms have changed substantially . Reliability can be established by Cronbach's alpha statistic  and by correlating scores between repeated assessments. Validity can be established via ‘concurrent validity’ by correlating the scores of the target instrument and the scores derived from other instruments of the same or similar construct that have previously been validated. Another important concept is the ‘construct validity’, which is commonly established by a high degree of correlation between the target assessment and other assessments of similar construct (convergent validity) and the low correlation between the target assessment with other instruments of dissimilar construct (discriminant validity) .
Readers interested in other IRT models can find them in , who recently published a collection of WinBUGS code for many more IRT models, including the GPCM. However, Curtis did not provide a step-by-step link, as we have, between the mathematics and the WinBUGS code. The learner is encouraged to tackle the mathematics before plugging in the WinBUGS code—in our opinion, the recommended learning approach for biostatisticians who are already familiar with Bayesian statistics. A Bayesian approach to IRT has its own advantages. Baldwin et al.  recently reported in this journal that a Bayesian approach to IRT requires much smaller sample sizes than is required by conventional likelihood methods. The GPCM model in Equation (7) may appear complex, but a clear understanding of the mathematics reduces the model to three nested loops in the WinBUGS code. We believe that this resemblance between mathematics and computer syntax is an important advantage in preferring WinBUGS over special-purpose computer packages. It facilitates a deeper understanding of the IRT theory as well as the statistical computation. WinBUGS forces the learner to acquire a clearer understanding of the statistical model, which hopefully discourages indiscriminately applying existing data analysis ‘recipes’. Similarly, experienced biostatisticians who are already familiar with the r language no longer need to learn the sometimes eccentric computer programming languages of proprietary IRT software. It opens up new possibilities as well, as seen in analyses on gene expression data , explanatory IRT models , and recent psychometric work in cognitive models of IRT [59, 60].
Iterative simulation is time consuming. It is the main disadvantage against the Bayesian IRT approach. The time and effort diminish the enthusiasm for WinBUGS and its open-source version OpenBUGS in IRT analysis (also JAGS ). Analysts save time by using existing macros in their preferred statistical computer package, for example, the macros in sas (SAS Institute, Cary, NC, USA) [62-64] and stata (StataCorp, College Station, TX, USA) [65-67], and solutions based on sas NLMIXED procedure , if all they want are the parameter estimates and other statistics supported by these macros. Another limitation is the somewhat steep learning curve for beginners. It is compounded by practical complications. For example, problems in model convergence are hard to diagnose and rectify. Mistakes in WinBUGS syntax are hard to debug because error messages are often nonspecific. These are some of the reasons why beginners are intimated by WinBUGS. These limitations notwithstanding, we believe that circumstances will improve over time, with the publication of WinBUGS code collections and tutorials similar to this one. Psychometric models and methods in r are rapidly emerging. In 2007, the Journal of Statistical Software dedicated volume 20 to psychometric methods in r. Another collaborative effort is found in the online ‘Task View’ maintained by Mair and Hatzinger (http://cran.r-project.org/web/views/Psychometrics.html, last accessed July, 2011). New development efforts in the OpenBUGS program include improvements in the documentations as well as cross-platform compatibility. We are optimistic that, in time, as new and accessible knowledge bases accrue, r and WinBUGS will become the preferred tools for fitting IRT models.
APPENDIX A: Detailed derivations of the partial credit model
The G term in Equation (5) is crucial in understanding how the PCM equation is derived. However, a few algebraic steps were omitted in Masters' original derivations [, p. 158] and Muraki's  subsequent definition of the G term. They are restored here. These details are also helpful in tracking the model syntax in WinBUGS. We begin by restating the conditional probability of endorsing each successive response category k over k − 1 for item j:
In the aforementioned K = 4 example,
Thus, πj1 = [(πj0 + πj1) exp(θi − βj1)]/[1 + exp(θi − βj1)], which can be simplified as follows:
so πj1 = exp(θi − βj1) ⋅ πj0 because the exp(θi − βj1) ⋅ πj1 terms cancel out. To make the notations easier to track, we temporarily replace the somewhat cumbersome term of exp(θi − βjk) with δk. We have πj1 = δ1πj0. Similarly, πj2 = δ2πj1 and πj3 = δ3πj2. Because each person must choose one of the four possible responses for item j, πj0 + πj1 + πj2 + πj3 = 1. We can solve for πj0:
Let G = 1 + δ1 + δ2δ1 + δ3δ2δ1, we have
To obtain a general notation, we define that item j is scored y = 0,1,2, … ,mj with Kj = mj + 1 response categories and the denominator . This sum follows a highly regular pattern, which involves exp(0) in the first term, exp(0) × exp(θi − βj1) in the second term, exp(0) × exp(θi − βj1) × exp(θi − βj2) in the third term, and so on until the last term . Thus,
where for convenience. Finally, a general model expression for Kj = mj + 1 that incorporates all of the aforementioned steps is given in the PCM in Equation (6).
APPENDIX B: More on item information
This appendix covers the derivations of the item information function that assesses the amount of information that can be expected from an item. The derivations are pedagogically useful, especially for biostatisticians who are familiar with Fisher's original definition  but are not familiar with psychometrics. Samejima [, Chapter 6] was first to introduce Fisher's information into IRT in her 1969 monograph, according to Baker and Kim [, p. 222]. The detailed derivations in Samejima's work are no longer available in the literature (e.g., [, p. 222]; [, p. 374]; [, p. 208]; [, p. 308]). Here, we summarize our independent derivations starting from Fisher's definition.
On the basis of Fisher's original definition, the information function for the latent characteristic θ given data y is as follows:
where, for the GPCM, the log likelihood of the probability density function is defined in Equations (7) and (8). There is no need to include α and β into the log-likelihood function because α and β are fixed, which also makes the notations easier to track. Incidentally, the first derivative of the log likelihood, ∂log Pr(y | θ)/∂θ, is also known as the score of the probability density function. Hogg and Craig [, pp.373–374]) show that Equation (B1) is the same as the following:
and thus the following definition given in texts such as Lee [, p. 82], Berger [, Section 3.3.3], and Gelman et al. [, Section 2.9]:
the expectation being taken over all possible values of y for fixed item parameters α and β. To save notations further, we write p for the probability density function Pr(y | θ).
Muraki's  derivations for the item information function begin with Equation (B2). The integration is replaced with summation across mj discrete response categories for item j.
where the information of all possible response categories are weighted by their corresponding response probabilities. Up to this point, the definition of item information should be familiar to biostatisticians. However, an alternate but unfamiliar definition is used in the psychometric literature (e.g., [, Equation (2)]):
Lee [, pp. 82–83] provides a general proof that Equations (B4) and (B5) are indeed equivalent. However, the seemingly different equations may be a source of confusion and frustration in learning Fisher information in IRT. The intermediate derivations from Equations (B4) and (B5) are not found in recently published texts (e.g., [, p. 222]; [, p. 374]; [, p. 208]; [, p. 308]). The missing derivations are summarized next.
In Equation (B4), we need to find the second derivative of log p with respect to θ. We take the first derivative and obtain
by the general power rule in calculus. We then take the negative of the second derivative, , and follow the product rule to get the following:
By plugging in the previous result back into Equation (B4), we get
The second term of Equation (B6), , must sum to zero. It is easier to see why this must be the case using the four-category example mentioned previously. When θ is fixed, p0 + p1 + p2 + p3 = 1, because each person must choose one of the mj possible responses for item j. By taking the second partial derivatives with respect to θ on both sides of the equation, we get . The sum of the second partial derivatives of the response probabilities must be 0. Thus,