SEARCH

SEARCH BY CITATION

Abstract

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Response time modelling
  5. 3. The linear transformation model
  6. 4. Model estimation
  7. 5. Simulation study
  8. 6. Real data application
  9. 7. Discussion
  10. Acknowledgements
  11. References

The item response times (RTs) collected from computerized testing represent an underutilized source of information about items and examinees. In addition to knowing the examinees’ responses to each item, we can investigate the amount of time examinees spend on each item. In this paper, we propose a semi-parametric model for RTs, the linear transformation model with a latent speed covariate, which combines the flexibility of non-parametric modelling and the brevity as well as interpretability of parametric modelling. In this new model, the RTs, after some non-parametric monotone transformation, become a linear model with latent speed as covariate plus an error term. The distribution of the error term implicitly defines the relationship between the RT and examinees’ latent speeds; whereas the non-parametric transformation is able to describe various shapes of RT distributions. The linear transformation model represents a rich family of models that includes the Cox proportional hazards model, the Box–Cox normal model, and many other models as special cases. This new model is embedded in a hierarchical framework so that both RTs and responses are modelled simultaneously. A two-stage estimation method is proposed. In the first stage, the Markov chain Monte Carlo method is employed to estimate the parametric part of the model. In the second stage, an estimating equation method with a recursive algorithm is adopted to estimate the non-parametric transformation. Applicability of the new model is demonstrated with a simulation study and a real data application. Finally, methods to evaluate the model fit are suggested.


1. Introduction

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Response time modelling
  5. 3. The linear transformation model
  6. 4. Model estimation
  7. 5. Simulation study
  8. 6. Real data application
  9. 7. Discussion
  10. Acknowledgements
  11. References

Web-based assessment (i.e., online testing) is becoming a mainstream form of testing due to the internet’s flexibility, accessibility, and potential capacity for faster data analysis and reporting. It also makes the collection of examinees’ response times straightforward. Response times (RTs) provide a valuable source of information on examinees and test items. For instance, RTs can be used to evaluate the speededness (van der Linden, Breithaupt, Chuah, & Zhang, 2007) of the test, to improve the accuracy of examinees’ ability estimation, to detect cheating behaviours and to design better tests (e.g., Bridgeman & Cline, 2004; Chang, 2004; van der Linden & Guo, 2008; van der Linden, 2009). Several parametric models have been proposed for RTs from various tests. The models differ in terms of the assumed RT distribution (e.g., lognormal, exponential, Weibull), the underlying relation between ability and response speed, and the nature of items for which the model was designed (Schnipke & Scrams, 2002). Generally, the parametric models are concise and easy to interpret. However, with a new data set, one often needs to fit each parametric model separately until a best-fitting model is determined based upon some model diagnostic criterion (Schnipke & Scrams, 1997).

The idea of proposing a general model that includes various parametric models as submodels was first put forward by Ying and Chang (2005), and one example is the Box–Cox normal model (Klein Entink, van der Linden, & Fox, 2009b), in which a power parameter is introduced to represent a number of different transformations. Wang, Fan, Chang, and Douglas (2012) proposed a semi-parametric model for response times that considers continuous RTs and discrete item responses simultaneously through a two-level framework. In this paper, we propose an even more general model originating from the linear transformation model, which only assumes the existence of a monotone, but otherwise arbitrary transformation of the RTs such that the linear model holds. The semi-parametric nature of the model allows considerable generality and applicability but enough structure for useful substantive interpretation. In fact, by allowing the error term to take on different distributions, the linear transformation model includes the lognormal model, the Box–Cox model, the proportional hazards model and many other models as special cases. Due to its flexibility, this model has already been widely used in biostatistics to explore the effects of the covariates on (cancer) patients’ survival times. In those applications, however, the covariates are often observed, such as tumour type and measure of general fitness (Cheng, Wei, & Ying, 1995; Kalbfleisch & Prentice, 1973). Researchers later recognized the correlations among survival times that are due to either repeated measurements taken on a single subject, or measurements of a common variable taken on genetically associated subjects, and this gave rise to the development of frailty models, in which a latent frailty random variable is included in the model to account for possible correlations in survival time distributions (Clayton, 1991; Clayton & Cuzick, 1985). However, the standard frailty model only generalizes the proportional hazards model by incorporating a random effect, such that units within the same group (or response times for all test items within an individual) share the same frailty. The response time for each item within a same individual is assumed to be independent. It is only recently that some researchers have introduced the frailty term into the linear transformation model; for example, Mallick and Walker (2003) in which the model is used in Veteran’s Admission lung cancer trial data. Dunson (2003) proposed a slightly different model called the ‘dynamic latent variable model’ for multidimensional longitudinal data. In that model, the dependent variables are assumed from a distribution in an exponential family with canonical parameter, after a certain known monotone transformation, being equal to a linear combination of covariates (either observed or latent) plus an error term.

This paper is the first to introduce the linear transformation frailty model into RT analysis in psychological measurement, and it has two innovations. First, the linear transformation frailty model is placed as a first-level measurement model in a two-level model framework such that the response times and response accuracy are estimated simultaneously. Second, we propose a two-stage estimation method incorporating a rank-based marginal likelihood as a key building block. This method offers a way of estimating linear transformation models with latent covariates together with the population covariance matrix at the second level. The new method is also flexible enough to deal with sparse data, such as those often collected from computerized adaptive testing.

2. Response time modelling

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Response time modelling
  5. 3. The linear transformation model
  6. 4. Model estimation
  7. 5. Simulation study
  8. 6. Real data application
  9. 7. Discussion
  10. Acknowledgements
  11. References

Gulliksen (1950) was the first to make the distinction between power tests and speed tests. In a pure speed test, the items are easy and the examinees are asked to answer as many items as possible within a limited time period. The goal is to measure how quickly the examinees answer the items. In this sense, speed tests are limited to simple cognitive tasks. In a power test, the items differ in difficulty and there is no time limit. For these tests, response accuracy is of interest. In practice, although most tests (especially achievement tests) are power tests, they also contain a speed component in that they are administered with a certain time limit.

Klein Entink et al. (2009) summarized three different approaches that have been taken in the past to model RTs. The first approach models response times exclusively, so that it is mainly applicable to speed tests which have strict time limits. Such models include the exponential model (Scheiblechner, 1979), the Weibull model (Rouder, Sun, Speckman, Lu, & Zhou, 2003), and the gamma model (Maris, 1993), among others. The second approach focuses on separate analysis of RTs and response accuracy. For instance, Gorin (2005) regressed log-transformed RTs on decomposed item difficulty parameters. Similar ideas are seen in Embreston (1998) and Primi (2001). This approach assumes that RTs and responses vary independently, which might not be true in educational measurement. The third approach advocates joint modelling of both RTs and responses, and such models include those of Thissen (1983), van der Linden (1999), Roskam (1997), Wang and Hanson (2005), to name but a few.

2.1. Joint response time and accuracy model

A major group of models in this category is motivated by the idea of a speed–accuracy relationship. Cognitive psychologists often focus on the within-person relationship, that is, whether a person’s response accuracy will decrease if he or she chooses to perform a task more quickly. This is termed the ‘speed–accuracy’ trade-off. Psychometricians, however, are more interested in the across-person relationship between speed and accuracy. For example, one question that psychometricians often explore is whether examinees with higher abilities tend to answer the items faster. Both types of speed–accuracy relationships are considered within the models suggested by Verhelst, Verstralen, and Jansen (1997) and Thissen (1983). In their models, the speed–accuracy trade-off is reflected in letting response accuracy depend on the time devoted to the item – spending more time on an item increases the probability of a correct response. The speed–accuracy correlation across examinees is reflected in the separate parameters of examinees’ abilities (or mental power) and speed. A particular joint model is that of Thissen (1983), which takes the form

  • image(1)

where εnjN(0, σ2). The normally distributed error term indicates that the model belongs to the lognormal family. The parameters τn and βj can be interpreted as the speed of the examinee and the amount of time required by the item. The parameter μ is a general intercept parameter, and aj, bj and θn are the item discrimination, item difficulty and examinee ability parameters, respectively. The term ρ(ajθnbj) represents a regression of a, two-parameter response model on the logarithm of time with ρ being the regression parameter. The speed–accuracy trade-off is indicated by the term ρ(ajθnbj) when ρ < 0. When ρ > 0, the speed–accuracy relation reverses.

2.2. Hierarchical model

Van der Linden (2007) and Klein Entink et al. (2009) argued that the speed–accuracy trade-off is a subject-specific phenomenon in that each individual may go through a unique mechanism to balance between speed and accuracy. If one chooses to work with a certain speed then this implies a level of accuracy. On a test with a reasonable time limit, there is no need to incorporate such a personal-level trade-off in an RT model for a fixed person and a fixed set of test items. In other words, the speed at which the test taker operates on the items should be assumed to be a latent trait, and the variability in response times can be explained given the latent speed level. This assumption has been supported by a number of researchers. Kennedy (1930) found that individuals tend to perform at a constant rate of work across a variety of cognitive tasks, even after adjusting for differences in intelligence. This conclusion was supported by Tate (1948), who investigated the speed–accuracy relationship on number series, arithmetic reasoning, and spatial relations questions. He found that when accuracy is controlled, the fastest examinees are not the most accurate, but fast subjects are consistently fast and slow subjects are consistently slow. These results imply that we need to model accuracy dependent on ability, and response time dependent on examinees’ latent speeds. But at the higher level, the speed and ability may be correlated. In this way, the response accuracy is influenced by the speed of working only through the second-level correlation between speed and ability. The level of correlation may differ depending upon the test context and content (Schnipke & Scrams, 2002).

Following this argument, van der Linden (2007) proposed a hierarchical framework in which RTs and responses are modelled separately at the measurement model level; and at a higher level, a population model for the person parameters (speed and ability) is constructed to account for the correlation between them. This model distinguishes the speed–accuracy trade-off within a person from the speed–accuracy correlation in the population. Klein Entink et al. (2009b) proposed a multivariate multilevel model for mixed response variables (binary response accuracy and continuous RTs). Their model allows for the incorporation of explanatory variables to identify factors that explain variation in speed and accuracy between individuals who may be nested within groups. Another attempt was made by Klein Entink, Kuhn, Hornke, and Fox (2009a). They proposed a joint modelling approach using response accuracy and RTs to evaluate cognitive theory. Their model has a similar structure to van der Linden’s (2007) hierarchical model, and the innovation lies in the decomposition of each item parameter based on the detailed cognitive process required by the item, in order to validate the cognitive theory that underlies the item design (Klein Entink et al., 2009a).

The measurement models for response times in the papers mentioned above are all parametric models; they either utilize the lognormal model or the Box–Cox normal model. However, a recent paper by Ranger and Kuhn (2012) demonstrated that the RT distribution differed dramatically across items within one test, which calls for a flexible model that relaxes such distributional assumptions. In the present paper, we replace the parametric RT model with the semi-parametric linear transformation model in the two-level framework.

3. The linear transformation model

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Response time modelling
  5. 3. The linear transformation model
  6. 4. Model estimation
  7. 5. Simulation study
  8. 6. Real data application
  9. 7. Discussion
  10. Acknowledgements
  11. References

The independent random variables, T1, …, Tn, are said to follow a linear transformation model if, for some increasing transformation H,

  • image(2)

where Zi is an observed covariate (it can also be a vector), β is the regression parameter, and ɛ1, …, ɛn are independent and identically distributed with distribution F. This model indicates that, after some order-preserving transformation, the dependent variable is related to Z in a simple linear fashion except for random errors (Cuzick, 1988). Parametric forms for H have been studied extensively in the literature; some examples are (1) H(t) = (t+c)λ, (2) H(t) = sign(t)|t|λ, and (3) H(t) = (tλ− 1)/λ, for λ≠ 0, and H(t) = log t, for λ= 0 (Box & Cox, 1964). Here t is a realization of the random variable T. The third example is the well-known Box–Cox power transformation. When the parametric family of H is not specified, equation (2) becomes a non-parametric model where H is continuous and monotone, but otherwise arbitrary. Linear transformation models represent a rich family of different models. For example, if ɛ follows the standard logistic distribution, then (2) reduces to the proportional odds model (Bennett, 1983; Pettitt, 1982); if ɛ follows the standard normal distribution, then (2) becomes a semi-parametric extension of the Box–Cox model (Doksum, 1987).

A common way of viewing the linear transformation model as a generalized model is from the perspective of the flexible link function. The link function represents the underlying relationship between the explanatory covariates and the observed response time. The hazard function, normally denoted by h(t), is the instantaneous rate at which response times occur. Models typically consider the direct effect of covariates on the hazard function, rather than the survival function. Often the effects of the covariates (Z) are reflected via a linear function, Z′β, and therefore

  • image(3)

Here β′= (β1, …, βp) are regression parameters and c is a specified functional form, also known as the link function. h0(t) is the baseline hazard, which can take on either parametric or non-parametric forms. Three common forms for c are (1) c(s) = 1 +s, (2) c(s) = (1 +s)−1, and (3) c(s) = exp (s). The first two forms correspond to (1) the hazard rate and (2) mean survival time, being linear functions of Z. The last form, which is also the one most widely used, assumes that a unit increase in a covariate is multiplicative with respect to the hazard rate, and is used in the Cox proportional hazards model.

Two widely used parametric regression models are the exponential model and the Weibull model (Rouder et al., 2003). In the exponential model, the hazard function takes the form h(t|Z=z) =λexp (z′β). If a log transformation is taken on both sides, this model specifies that the log hazard rate is a linear function of the covariate Z. In terms of the log response time, Y= log T, the above model can be reparameterized as Y=α−Z′β+W, where α=−log λ and W follows the extreme value distribution. This indicates that the exponential model can be viewed as a log-linear model, and it is a linear model for Y with the error term W having an extreme value distribution. The Weibull model can also be reparameterized in the same fashion. Recall that the conditional hazard in the Weibull model takes the form h(t|Z=z) =γ(λt)γ− 1exp (z′β). Due to the exponential link function (c(s) = exp (s)), the effect of the covariates is again acting multiplicatively on the Weibull hazard. Let Y= log T, the Weibull model can be expressed in the linear form as Y=α+Z′β*+σW, where α=−log λ, σ=γ−1 and β*=−σβ. W again follows a standard extreme value distribution.

Another widely used parametric regression model comes from the lognormal distribution of time (van Breukelen, 2005; van der Linden, 2007). The lognormal regression model generally takes the form

  • image

where inline image. Then the hazard function can be written as

  • image

where φ(·) and Φ(·) are the probability density function and cumulative density function of the standard normal distribution, respectively. Apparently, the covariate is not multiplicatively related to the hazard function, and the effect of the covariates cannot be written in an exponential link function as in exponential or Weibull regression models. For these reasons, the lognormal regression model is not a special case of the proportional hazards model. However, it does belong to the linear transformation model family. Therefore, the linear transformation model allows for a great deal of flexibility in representing RT distributions and in modelling the relationship between covariates and RTs, while still preserving a large degree of parsimony in describing the effect of item/examineee covariates on RTs.

3.1. Hierarchical linear transformation IRT model

In this paper, we adopt van der Linden’s (2007) hierarchical framework while replacing the lognormal model with the linear transformation model for response times. In particular, the model we propose is as follows.

3.1.1. First-level model

At the first level, two models for the responses and RTs are specified separately. For the item response model, any appropriate parametric model may be used, but we focus on the three-parameter logistic model:

  • image(4)

with aj, bj, and cj representing item discrimination, difficulty and guessing parameters, respectively. For the RTs, a linear transformation model is adopted:

  • image(5)

where inline image is the speed parameter for test taker i. Hj represents the monotone transformation for item j, and this item-level transformation implies that different types of RT transformations will be possible for different items. βj is a discrimination-like parameter. Negative βj means examinees with higher speed will tend to have shorter RTs. The ɛij are independent and identically distributed with distribution F and independent of the τi. Because the τi are latent variables, they are sometimes referred to as frailties. We assume F is known and the same across different items. The three distributions we consider in the simulations are normal, extreme value and logistic.

3.1.2. Second-level model

Similar to the van der Linden (2007) model, this level captures the joint distribution of the person parameters in a population. The values of inline image are assumed to be randomly drawn from a multivariate normal distribution, that is,

  • image(6)

with mean vector

  • image

and covariance matrix

  • image
3.1.3. Identifiability

To establish identifiability, we suggest the constraints inline image. The mean of τ is fixed to fix the centre of the non-parametric transformation H, and the variance of τ is fixed to remove the trade-off between βj and τi. The scale of H is determined by the fixed distribution of the error term ɛij. For instance, when the error term ɛij follows a normal distribution, inline image, we restrict it to be standard normal with σ2= 1. This is because the σ can be easily absorbed into the monotone transformation and the item parameters β on both sides of equation (5), inline image is equivalent to the original equation (5) and ɛ is now from the standard normal distribution. The mean and variance for θ are free to estimate in this case because we assume the item parameters (including discrimination, difficulty and guessing parameters) have already been well calibrated. This is often the case in item banks for which response times are recorded, IRT parameters have been calibrated, but no response time model has been calibrated.

3.1.4. Assumption

Both the response model and response time model described above rely on the assumption of local independence. The responses are locally independent given the examinee’s latent ability θ; the RTs are locally independent given the examinee’s latent speed τ; and the responses and response times are independent given θ and τ. The latent variables θ and τ are normally distributed, and the distribution for the error term Fɛ is fixed and known. Different from van der Linden (2007), the item parameters for the three-parameter logistic model are known in our case, thus we do not estimate the correlation structure of the item parameters.

To show that this new model is a generalization of van der Linden’s (2007) model, recall that in the latter the response time, after log transformation, follows the normal distribution

  • image(7)

Here τi is the speed parameter for examinee i, and δj and αj are the time intensity and discriminating power respectively of item j. Similar to the difficulty and discrimination parameters in the three-parameter logistic model, a higher value of δj indicates that the item requires a longer time to finish, and a higher σj indicates higher power to differentiate slow examinees from fast ones. Our linear transformation model takes a very similar form. Suppose that Hj is a log transformation, and that in order to allow different items to have distinct transformations we embed a scale term, Hj(t) = log (λjt); then the linear transformation model is expressed as log (λjt) =βjτiij, and we can further rewrite it as inline image, which implies that inline image. The item time intensity parameter, in this case, is represented by inline image.

4. Model estimation

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Response time modelling
  5. 3. The linear transformation model
  6. 4. Model estimation
  7. 5. Simulation study
  8. 6. Real data application
  9. 7. Discussion
  10. Acknowledgements
  11. References

Several attempts have been made by researchers to estimate the linear transformation model with or without frailty terms. When the covariates are completely observed, Chen, Jin, and Ying (2002) proposed an estimating equation method in which two separate estimating equations are constructed to estimate β and H, respectively. Their estimator for β reduces to the Cox partial likelihood estimator when the error term follows the extreme value distribution. It is easy to compute the estimator through the estimating equation which resembles the Cox partial likelihood score function; and the estimator also has nice asymptotic properties such as closed-form variance and asymptotic normality. When there are latent covariates (or frailty) terms, Mallick and Walker (2003) proposed a fully Bayesian Markov chain Monte Carlo (MCMC) method that is able to estimate both the parametric regression parameter β and the non-parametric transformation H within separate parallel Markov chains. Their method is even more flexible in that the distribution of the error term Fɛ is also unknown and free to estimate. Specifically, they propose to use a mixture of incomplete beta functions for H(·) and model Fɛ as a Pólya tree distribution. Dunson (2003) also employed a Bayesian MCMC estimation method with nicely specified full conditional distributions for each parameter, but his method is dependent on the fixed and known transformation H.

The goal of our investigation involves accurately estimating the parameters θi, τi (i= 1, 2, …, N), βj (j= 1, 2, …, J), μθ, inline image as well as the non-parametric transformations Hj. In particular, we would like our estimation technique to allow for data obtained by computerized adaptive testing, in which every test taker is given different items, based on his or her adaptively estimated θ level. So the random sampling of θ (or τ) from a common distribution cannot be assumed. Consequently, the usual marginal likelihood approaches used in latent variable modelling are no longer appropriate. Also because we have a latent frailty term and an unknown transformation H(·), the methods of Chen et al. (2002) and Dunson (2003) do not lend themselves directly to our model estimation. Mallick and Walker’s (2003) method seems promising, but our preliminary investigation showed that using mixtures of incomplete beta functions for H(·) involves two parts that need to be adjusted for every data set to produce accurate estimates. In addition, mixture beta functions are not always enough to approximate various shapes of the non-parametric transformation. In this paper, we propose a two-stage estimation method. In the first stage, we focus on the parametric part of the model, and only rely on the ranks of the observations instead of the observations per se so as to avoid the complications introduced by H(·). Specifically, we propose to use the ‘rank-based likelihood’ coupled with the MCMC method for parameter estimation. In the second stage, we will use the estimating equation method proposed in Chen et al. (2002) for estimating H(·) while treating the parametric part of the model as known.

4.1. Rank-based marginal likelihood forβ

In model (5), for fixed βj, a maximal invariant for τ under the group of monotone transformations is the vector

  • image

where t(1) < … < t(N) are the ordered ti and τ(i) is the corresponding covariate, so that (τ(i), t(i)), i= 1, …, N, is a permutation of (τi, ti). Knowing inline image is equivalent to knowing the ranks of the ti and τi, …, τN. Therefore, it is reasonable to use the marginal likelihood of inline image, which does not depend on Hj, to make inference about βj without any loss of information (Bickel & Ritov, 1998).

Denote the density of ɛij in model (5) by fɛ(·). Consider the group G of increasing differentiable transformations acting on t. If HG, the density function of t is

  • image(8)

Because the inference is based only on the ranks of the ti, the general location of the ti cannot be estimated, and a constant term in the linear model is not needed. To fix the scale, it is assumed that inline image, which resonates with the model assumption described earlier. By the definition of Barnard (1963), the rank vector is marginally sufficient for β and inferences on β can be based on the marginal likelihood generated by the probability function of rank statistics inline image as

  • image(9)

where αi is the anti-rank of tj, that is, αi=j if and only if tj is the ith smallest of t1, …, tn (Cuzick, 1988; Pettitt, 1982). For most choices of the density fɛ(·), the integral in (9) is not tractable, but the extreme value density which yields the Cox proportional hazards model is an exception. When fɛ= exp (xex), the integral in (9) becomes precisely the partial likelihood of β as long as the covariates are time-invariant (Kalbfleisch & Prentice, 1973; Kalbfleisch, 1978). For other densities fɛ(·), the integral in (9) can only be obtained through approximation. An approximation is given by Pettitt (1982) by means of Taylor series:

  • image(10)

Here inline image and ζ=βτi. The key is to obtain approximation for equation (10). Pettitt (1982) provided a detailed derivation for approximating equation (10) by

  • image(11)

where the matrix C and vector a have an explicit analytical form for some specific densities fɛ(·), such as the normal, logistic and double exponential distributions (Pettitt, 1982). The quality of the approximation worsens as the density fɛ(·) departs from normality (Pettitt, 1983).

When inline image, let inline image be the order statistics of a sample of size N from the standard normal distribution. Then a=E(Z) with inline image, and inline image where inline image. Details of the derivation are given in Pettitt (1982). Thus, if ξk is the mean of the kth order statistic in a random sample of standard normal distribution with size N, and ri is the rank of the response time from the ith examinee, then inline image. A is the variance–covariance matrix for the normal order statistics inline image from the observed response times. So if ξik is the (i, k)th element in the variance–covariance matrix for the standard normal order statistics, then inline image where ri and rk are the ranks for the ith and kth observations. Because the response times are continuous, we assume that the ranks are uniquely assigned to each observation without ties. But when there are ties, it is easier to break them by adding a very small number to one of the tied observations. More elegantly, the rank-based likelihood can be adjusted slightly, as described in Pettitt (1982). The main idea is to find the marginal and joint density of the tied observations as a uniform mixture of the density of the distinct order statistics when no ties are assumed. This treatment is justified by the ‘equiprobable’ assumption discussed in Bradley (1968).

When the error term follows the logistic distribution, f(y) =ey/(1 +ey)2, which results in the proportional odds model, the approximation to (9) is given as follows. The general form of the approximation in (11) stays the same, but

  • image

and C=BA, where B is a diagonal matrix (Pettitt, 1982). The marginal rank-based likelihood will be used for both β and θ estimation in the Metropolis–Hastings algorithm.

4.2. Estimating equation method forinline image

We use the estimating function developed by Chen et al. (2002) for estimating the non-parametric monotone transformation H(t). Let λ(·), Λ(·) be the known hazard and cumulative hazard functions of ɛ, respectively. Let Y(t) =I(Tt), N(t) =I(Tt) and let {Yi(t), Ni(t)} be the corresponding samples of {Y(t), N(t)}. For a single item, suppose there are N examinees answering that item. Then the estimating equation for H(t) is

  • image(12)

assuming inline imagei= 1, …, N, are the estimates in the first step. The solution inline image to (12) is the estimate of the unknown monotonic transformation H. Chen et al. (2002) further proposed a numerical algorithm for obtaining the solution. In particular, the first point estimate inline image is obtained by solving

  • image(13)

and the others recursively by

  • image(14)

A familiar special case occurs when the error term follows the extreme value distribution. In this case, because the linear transformation model is equivalent to the Cox’s proportional hazards model, we can also use the Breslow estimator (Breslow, 1972) to approximate the non-parametric transformation. The Breslow estimator aims to estimate the baseline cumulative hazard (inline image) in the proportional hazards model, and by virtue of the one-to-one connection between the transformation (H(t)) and the baseline hazard (inline image), we can estimate H(t) directly from the Breslow estimator, and the results will be comparable to the estimate calculated from (14).

4.3. Parameter estimation

4.3.1. Prior specification

A bivariate normal prior is chosen for the latent parameters (θ, τ), that is, inline image, where inline image and inline image A normal prior is chosen for each regression parameter βj with means equal to 0 and variance chosen to be 10. Here we deliberately selected a large variance to make the prior less informative. The correlation term ρθτ is also chosen to have a vague normal prior as in Klein Entink et al. (2009). But we restricted the prior to be within the range [−1, 1], that is, we employed a truncated normal prior inline image. This treatment of restricting the prior region of the covariance (in our case the correlation) parameter to the area that supports a positive definite covariance is discussed in Mulder and Fox (2011). For the variance term, we impose an inverse-gamma prior, that is, inline image. The gamma prior was chosen in Klein Entink et al. (2009) because it is a conjugate prior for the normal distribution with known mean; although the inverse-gamma is no longer a conjugate prior in our case with a logistic response model, we still adopt this prior.

4.3.2. Markov chain Monte Carlo

As mentioned before, we assume a setting in which item parameters are previously calibrated and are taken as known. This will be the case in item banks for which response times are recorded, item RT parameters have been calibrated, but no RT model has been calibrated. However, the estimation method introduced above can still be used when the item parameters in the three-parameter logistic model are unknown. If that happens, one just has to add three additional chains for estimating the a, b, and c parameters (Patz & Junker, 1999) separately. The reason why we assume them to be know is that we want to emphasize the estimation of the linear transformation model parameters, which are the focus and innovation of the present paper. During the estimation, we need to sequentially draw parameters inline image(or ρθτ), θ, τ and β. The details of the Metropolis–Hastings within Gibbs sampler are presented below.

Suppose the items are indexed by j= 1, …, J, and the examinees by i= 1, …, N. Denote the responses and response times of the ith test taker by inline image, and inline image, respectively. To perform the sampling for parameters with support on the entire real line, we use normal proposal distributions with mean equal to the current estimation and variance chosen to give a Metropolis acceptance rate of between 25 and 50 per cent. For parameters with support not on the real line, we first transform them to the real line and then sample them from normal proposal distribution.

Step 1. Denote the initial values for inline image, inline image, and inline image by inline image, inline image and inline image, respectively. inline image is the maximum likelihood estimator (MLE), maximizing the likelihood function formed by equation (4). The initial value inline image is calculated by the sample variance of inline image. The initial value of inline image is obtained differently depending upon the specific test conditions. The details are given in Wang et al. (under review). Conditioning on inline image, inline image is obtained by maximizing the approximation of the rank-based likelihood in (11). The initial value of inline image is the sample mean of the θs. Set the iteration counter iter = 1.

Step 2. At the rth step, denote the previous positions by inline image, inline image, inline image, inline image and inline image, inline image. Sample each parameter sequentially as follows.

  • • 
    inline image (variance of inline image). Because the variance is always non-negative, we draw inline image from inline image with acceptance probability
    • image(15)
    where
    • image
    with inline image, inline image and variance–covariance matrix
    • image
    inline image is the inverse-gamma prior for the variance term.
  • • 
    ρθτ (correlation between inline image and inline image). Because −1 ≤ρθτ≤ 1, we need to first transform ρθτ to the real line; the transformation we adopt is inline image. Then draw ϕ* from inline image, and the acceptance probability of the corresponding inline image is
    • image(16)
    where inline image is the truncated normal prior density of the correlation term, with inline image being the one-to-one transformation of ϕ*. The Jacobian matrix inline image is involved due to the transformation of inline image, and one will notice that it is the same as the transition matrix as if drawing inline image from a non-symmetric distribution (instead of drawing ϕ* from a symmetric normal distribution).
  • • 
    μθ (population mean of θ). Draw inline image from inline image with acceptance probability
    • image(17)
    where inline image is again a product of multivariate normal densities, and inline image is the normal prior density.
  • • 
    θ and τ (examinees’ ability and speed). For the ith pair (θi, τi), 1 ≤iN, draw inline image from a bivariate normal distribution with mean inline image and variance–covariance matrix
    • image
    The acceptance probability is now
    • image(18)
    where inline image. inline image is a bivariate normal with mean inline image and variance–covariance matrix inline image. IRT(·) is calculated from equation (4), and L(·) is defined by (11).
  • • 
    β (survival regression parameter). For the jth item, draw inline image from a normal distribution inline image with the acceptance probability defined as
    • image(19)
    where L(·) is defined in equation (11).

Step 3. Change the iteration counter from r to r+ 1 and return to step 1 until iter =M, where M is a pre-specified number.

Step 4. At the end of the chain, compute the posterior mean of each parameter. A burn-in period of the initial K iterations is often required to allow the chain to reach equilibrium. Once the parameters are well estimated, we move on to the second step of estimating the non-parametric monotone transformation.

When the data come from a cognitive task, it is often the case that the items (also known as ‘trials’ in cognitive experiments) are very similar to each other within a block. In this regard, we can treat the items belonging to the same class as having identical item parameters. There are two possible ways to approach this issue. We can either view those items as single items and aggregate examinees’ responses/RTs to those items, or we can treat them as conditionally independent given the examinees’ latent speed and ability but impose an equality constraint (i.e., we update those same item parameters together in a single chain while pooling information from responses and RTs for all those items to construct the acceptance ratio). Either way, the method described above can be applied in a straightforward fashion.

4.4. Model diagnosis

To check global fit, the deviance information criterion (DIC; Spiegelhalter, Best, Carlin, & van der Linde, 2002), which is analogous to the Akaike information criterion but designed specifically for Bayesian hierarchical models, may be used. A lower DIC usually indicates better fit. The DIC is equal to the deviance plus a penalty term for model complexity. The deviance is calculated as

  • image

where fɛ is the density of the error term distribution and Pji) is defined in equation (4). The DIC is given by

  • image

where inline image, in which M is the number of iterations of the algorithm, and inline image.

We also use the Kullback–Leibler (KL) distance as a global fit index. The KL distance measures the divergence between two probability density functions. In the current setting, for item j, we obtain a set of estimated error terms inline image, for i= 1, …, N, from which we can estimate its density by kernel smoothing as

  • image

According to the model, the true errors ɛij will follow some theoretical distribution f0jij). Therefore, we can calculate the KL distance between the empirical and theoretical distribution of ɛij as

  • image

This KL distance is averaged over all items to obtain an averaged KL distance, which quantifies the overall fit of the model. A smaller KL distance indicates a better fit. Considering the MCMC model estimation method, the averaged KL distance can be calculated at each point of the chain such that we can obtain the whole distribution of averaged KL distances. At an item level, we can check the fit graphically by comparing the empirical and theoretical distributions of the error terms.

5. Simulation study

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Response time modelling
  5. 3. The linear transformation model
  6. 4. Model estimation
  7. 5. Simulation study
  8. 6. Real data application
  9. 7. Discussion
  10. Acknowledgements
  11. References

Simulation studies were carried out to check the performance of the proposed MCMC estimation method as well as the recursive method for estimating the non-parametric transformation. As a starting point, we only consider the non-adaptive situation, in which each examinee takes the same set of items. The test length was set at 20, and the examinee sample size was set at 200. We chose such small values to demonstrate that even with short test lengths and small sample sizes, the estimation can still be accurate. This situation is also seen in adaptive designs, in which each item is measured by a certain group of examinees rather than the whole sample, and thus the examinee sample size will not be very large.

A total of 3 × 4 = 12 different conditions were simulated. The first factor represents the distribution of the error term, which is either extreme value, normal, or logistic. The second factor indicates the different monotone transformations. The first transformation was Hj(t) = log (λjt), with λ∼U(0.25, 1.5) as an item level parameter. The second was a Box–Cox transformation, inline image, with λ∼U(0.5, 1.5). To ensure H(t) was on the real line, we subtracted 5 from the transformation. However, some other arbitrary value could be chosen. When λ < 1, it is a monotone concave function, and when λ > 1, it is a monotone convex function. The third transformation, Hj(t) =λj(t− 5)3 with λ∼U(0.5, 2), has a inflection point in the middle. Again, 5 is chosen because it is the median of the response times, but other arbitrary values can also been chosen. The fourth transformation was a mixture of the previous three, with 7, 7, and 6 items belonging to log, Box–Cox, and inflection transformations. These transformations and corresponding parameters were chosen to produce realistic response time distributions.

An illustration of the possible RT distributions is given in Figure 1, with each curve representing the shape of the histogram of the RT distributions. The curves were obtained by averaging over 100 replications. As one might observe, all the transformations had supports on the positive real line, and the log transformation produced very skewed distributions. This type of RT distribution is seen when the items are very easy so that most examinees answer the items within a very short time, and only examinees with extremely low abilities (or low speed) tend to take a long time to finish. The Box–Cox transformation yielded either skewed or near symmetric RT distributions, depending upon the value of the power transformation parameter. The transformation with an inflection point yielded a bimodal distribution, and this distribution often indicates that examinees engage in two different strategies when answering the items. The examinees’ latent traits (θ, τ) were generated from a bivariate normal distribution with mean (0, 0) and covariance matrix inline image.

Figure 1. The possible RT distributions from different combinations of error term distribution and transformations.

Download figure to PowerPoint

image

Bias and mean squared error (MSE) were calculated to evaluate the closeness of the estimated parameters to their true values. For population parameter μθ, σθτ and inline image, only bias was calculated. Tables 1 and 2 present the average bias and MSE of θ for both models under the 12 simulations conditions. Both average bias and MSE were obtained over all examinees and all replications within a simulation condition. To show that with RT as collateral information, the estimation of θ will be more accurate, we present MSE calculated from both the initial inline image (MLE of θ estimated from responses only) and the final estimate of inline image. The recovery of the non-parametric transformation was evaluated by the standardized version of the integrated absolute difference between the true Hj(t) and its estimate inline image for the jth item,

  • image(20)

where the integration is done over the possible RTs for item j. The denominator is added to remove the scale differences inherent in different transformations. The mean of δ(H)j is also reported in Table 2.

Table 1. Bias of the parameters.
  θ (IRT)θ (IRT+RT)βτσθτ inline image μθ
Normal errorlog−0.031−0.030−0.0290.0510.0980.331−0.076
 Box–Cox0.0180.0090.0370.0190.0590.2970.088
 Reflection0.035−0.0300.0560.0590.0910.2980.081
 Mixed0.0290.0210.0810.0630.0440.3410.006
Logistic errorlog−0.017−0.0080.091−0.0180.0430.2710.007
 Box–Cox−0.056−0.045−0.0200.0170.0410.351−0.006
 Reflection0.0590.037−0.0180.0210.0050.2310.008
 Mixed0.0370.0290.0360.0510.0420.2810.018
Extreme valuelog0.005−0.0070.0490.0190.0410.2590.019
 errorBox–Cox0.0390.0310.0410.0420.0510.3010.018
 Reflection0.051−0.0290.039−0.047−0.0190.2440.015
 Mixed0.0030.0050.047−0.019−0.0290.2370.109
Table 2. MSE for the parameters and absolute difference between non-parametric transformations.
  θ (IRT)θ (IRT+RT)βτδ(H)
Normal errorlog0.1410.1110.0120.0431.761
 Box–Cox0.1610.1260.0080.0520.812
 Inflection0.1530.1250.0260.0420.479
 Mixed0.1610.1230.0220.0450.638
Logistic errorlog0.1690.1380.0780.1471.623
 Box–Cox0.1640.1090.0270.1530.809
 Inflection0.1610.1310.0290.1590.244
 Mixed0.1570.1060.0460.1180.471
Extreme value errorlog0.1690.1180.0410.0591.701
 Box–Cox0.1680.1210.0310.0620.712
 Inflection0.1650.1210.0230.0430.574
 Mixed0.1570.1160.0210.0470.581

As shown in Table 2, with normal errors and extreme value errors, the examinees’ speed parameter τ were very accurately recovered with extremely small MSE. However, when the error term followed the logistic distribution, the MSE of τ was much larger because the approximation to the rank-based likelihood (as in equation (10)) is less accurate. The bias results display similar patterns. All the bias values are acceptably small except for the bias of β in the logistic error model, and this is also due to the less than ideal approximation. The MSE of θ decreased when the response time information was considered, and this indicates that the response times provide useful collateral information to locate examinees’ true abilities. The MSE of β was uniformly small with different error distributions. The various shapes of the monotone transformations did not affect the estimation results either. The parameters σθτ and μθ were recovered accurately with small bias across different conditions, whereas inline image was often estimated with large positive bias. Considering that the ability variance will not affect our interpretations of the RT information as well as its relationships with responses, the results are still acceptable. The non-parametric transformation can also be accurately recovered by displaying a small standardized integrated difference. Only the log-transformation showed slightly larger differences between the true and estimated transformations, and this is because the log transformation has a long and nearly flat tail that is relatively hard to capture, especially bearing in mind that only a few examinees will have extremely long RTs (see Figure 1). To further show that the unknown transformation can be accurately recovered, we present the true transformation versus estimated transformation for the normal error model in Figure 2.

Figure 2. True versus estimated non-parametric transformation for normal error model.

Download figure to PowerPoint

image

6. Real data application

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Response time modelling
  5. 3. The linear transformation model
  6. 4. Model estimation
  7. 5. Simulation study
  8. 6. Real data application
  9. 7. Discussion
  10. Acknowledgements
  11. References

The linear transformation models were applied to a data set from a large-scale, high-stake computerized adaptive test. The data set consisted of 21,444 examinees and 620 multiple-choice items in total. The default test length is 37, but the number of items that each examinee answered ranged from 25 to 37. Because of the computerized adaptive version, each item was answered by different sets of examinees, and the number of examinees taking each item ranged from 6 to 489. We randomly sampled 3000 examinees from this population for analysis. However, we deleted 319 examinees because their RTs were not recorded; we further deleted 548 examinees either because their total RTs were too long (longer than 75 minutes) or because they failed to finish the whole test (i.e., test length is shorter than 37). Tests longer than 75 minutes occurred because some examinees took the test under non-standard accommodation settings. The resulting 2061 observations were used in the analysis. The original RTs were recorded in milliseconds, and for ease of calculation we converted the RTs to minutes by dividing each RT record by 60,000. The posterior distributions for the item and person parameters were approximated using the MCMC algorithm described above. Non-informative priors were chosen for item parameter β and population parameters μθ, inline image and σθτ. The bivariate normal distribution with mean μ= (μθ, 0) and covariance matrix inline image was chosen as the prior for (θ, τ) for each examinee.

Three linear transformation models with different error distributions were fitted to the data, and all of them converged successfully. Summary statistics for the model parameters are given in Table 3. The means and standard deviations (SD) of examinees’ ability estimates are very close across the models. The other parameters, however, differ quite a bit in terms of mean and standard deviation. The standard deviation of β for the logistic error model is very large compared to the other two models because we found 9 out of 620 items with extremely large β values (β > 6). The correlations between examinees’ latent ability and speed are all larger than .4, and when the error terms follow extreme value distributions the correlation is as high as .594. This result indicates that examinees with higher ability tend to answer the items faster, which is consistent with common sense. The DIC values for the three models are presented in Table 4, and Figure 3 displays the boxplots of the KL distance measure for each model. The KL distance is calculated on the last 1000 iterations, taking the first 3000 iterations as burn-in. The results show that both DIC and KL distance measures favour the proportional hazards model, followed by the normal error model, whereas the logistic error model shows the poorest fit. Even so, when evaluating item-level fit, we found that the proportional hazards model might not always be a best fit. For instance, Figure 4 presents one particular item that is best fitted with the logistic error model.

Table 3. Summary statistics for three linear transformation models.
  θτβρθτ inline image μθ
Normal errorMean0.639−0.008−0.0050.5481.7660.646
  SD 1.1340.7130.572   
Logistic errorMean0.6350.0080.5910.4171.8780.625
  SD 1.1490.7881.964   
Extreme value errorMean0.6450.003−0.1090.5941.6530.646
  SD 1.1110.8020.798   
Table 4. Summary statistics for three linear transformation models.
ModelDIC
Normal error model−2.725 × 107
Logistic error model−1.025 × 107
Extreme value error model−2.745 × 107

Figure 3. Boxplots of the KL distance measure for the three models.

Download figure to PowerPoint

image

Figure 4. Item fit analysis.

Download figure to PowerPoint

image

7. Discussion

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Response time modelling
  5. 3. The linear transformation model
  6. 4. Model estimation
  7. 5. Simulation study
  8. 6. Real data application
  9. 7. Discussion
  10. Acknowledgements
  11. References

Response times on test items are easily collected in modern computerized testing. Analysing response times provides useful collateral information to further understand examinees’ behaviours and item/test characteristics. Many parametric latent trait models have been proposed in the past to model RTs exclusively or with responses simultaneously. We have proposed a semi-parametric model that is able to represent a variety of RT distributions. In particular, the linear transformation with frailty model proposed in the current paper is a generalization of the lognormal model (van der Linden, 2007), the Box–Cox normal model (Klein Entink et al., 2009b), the Weibull model (Rouder et al., 2003), and the Cox proportional hazards model (Wang et al., 2012).

Although the semi-parametric nature of the linear transformation model creates the flexibility to fit the data well, one challenge lies in model estimation. We proposed method uses the marginal likelihood of rank coupled with the Metropolis–Hastings within Gibbs sampler for the estimation of the parameters in the model, and then uses the recursive algorithm (Chen et al., 2002) for estimating H. This two-stage estimation method is able to recover the true model parameters and unknown transformation very well. One limitation of the current estimation method is that we need to know the parametric form of the error term distribution beforehand. With different error distributions, the approximation to the rank-based likelihood changes substantially. However, based on the real data example given in the paper, a fixed error distribution is sometimes too restrictive, and it is often possible that some items are better fitted with a normal error model whereas others are more consistent with a logistic error model. If that happens, we need to employ an even more flexible model with error distributions unspecified. This generalization will certainly introduce extra difficulty in estimation. Mallick and Walker (2003) provided a fully Bayesian estimation method for such generalized model with observed covariates, and they employed a Pólya tree distribution as a prior for the unknown distribution Fɛ. Their method might be a promising starting point for extending the current model with additional flexibility.

Despite the flexibility this model affords, real online testing faces considerable challenges, resulting in measurement errors associated with the combination of hardware, software, and network connections. Therefore, the response time patterns can vary substantially across and within testing sessions, indicating that even more flexible models might be required. In particular, such real challenges indicate substantial heterogeneity in the variance of recorded times, even within subjects. As a consequence, it becomes hard to summarize any two observations, or any two sets of observations, in the same parametric model. The semi-parametric model proposed in the current paper allows for some random error through the error term ɛij for every item–examinee pair. But for ease of model estimation, we restricted the error term to follow a specific parametric distribution for every item. In the future, we can allow the error term either to follow different parametric distributions for different items, or to be completely unspecified and estimate it non-parametrically. This generalization will be able to represent a larger range of random errors.

A recent cutting-edge technology for online testing is browser/server (B/S) technology that uses commonly available web-browsing software on the client side and a simple server that can be fitted to a regular PC or laptop. The statistical computing (such as intermediate maximum likelihood estimation and item selection) is done via the gigantic computing force formed by the individual central processing units such that the effect of the internet speed is reduced. Thanks to advances in web technology, delivering online testing becomes easier and more reliable (Chang, 2012). If something related to the hardware or internet connection happens unexpectedly (e.g., the internet disconnects for quite some time during the test), this will introduce a systematic error to the model rather than the random error, which could certainly have a negative impact on item and person parameter estimates. However, new technology promises to keep this to a minimum, and systematic errors can further be addressed by residual methods designed to detect outliers occurring in such a manner. The methods for diagnosing test speededness (van der Linden et al., 2007) could be used for this purpose in a similar fashion.

When constructing a useful model for a real data set, it may come down to a trade-off between model identifiability, flexibility, and interpretability. ‘All models are wrong, but some are useful’, as the statistician G.E.P. Box said, and we need to make sure not to overfit or underfit a data set. In this regard, model diagnosis is critical. We suggested using both the DIC and Kullback–Leibler divergence to evaluate the fit of a hierarchical linear transformation model. More local aspects of model fit may be accomplished through posterior predictive checks (Sinharay & Johnson, 2003; Sinharay, Johnson, & Stern, 2006). These global and local techniques, taken with a variety of flexible joint models to choose from, promise to improve the efficiency of testing.

Acknowledgements

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Response time modelling
  5. 3. The linear transformation model
  6. 4. Model estimation
  7. 5. Simulation study
  8. 6. Real data application
  9. 7. Discussion
  10. Acknowledgements
  11. References

This work was funded by a grant from the National Science Foundation, NSF-MMS 0960822. Part of the paper was originally presented at 2011 annual meeting of the Psychometric Society, Hong Kong. The authors are indebted to the associate editor and three anonymous reviewers for their suggestions and comments on the earlier manuscript.

References

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Response time modelling
  5. 3. The linear transformation model
  6. 4. Model estimation
  7. 5. Simulation study
  8. 6. Real data application
  9. 7. Discussion
  10. Acknowledgements
  11. References