Integrating Cognitive Process and Descriptive Models of Attitudes and Preferences

Authors


Abstract

Discrete choice experiments—selecting the best and/or worst from a set of options—are increasingly used to provide more efficient and valid measurement of attitudes or preferences than conventional methods such as Likert scales. Discrete choice data have traditionally been analyzed with random utility models that have good measurement properties but provide limited insight into cognitive processes. We extend a well-established cognitive model, which has successfully explained both choices and response times for simple decision tasks, to complex, multi-attribute discrete choice data. The fits, and parameters, of the extended model for two sets of choice data (involving patient preferences for dermatology appointments, and consumer attitudes toward mobile phones) agree with those of standard choice models. The extended model also accounts for choice and response time data in a perceptual judgment task designed in a manner analogous to best–worst discrete choice experiments. We conclude that several research fields might benefit from discrete choice experiments, and that the particular accumulator-based models of decision making used in response time research can also provide process-level instantiations for random utility models.

1. Introduction

Many fields rely on the elicitation of preferences. However, direct questioning methods, such as Likert scales, suffer from well-established drawbacks due to subjectivity (for a summary see Paulhus, 1991). Discrete choice—for example, choosing a single preferred product from a set of presented options—provides more reliable and valid measurement of preference in areas such as health care (Ryan & Farrar, 2000; Szeinbach, Barnes, McGhan, Murawski, & Corey, 1999), personality measurement (Lee, Soutar, & Louviere, 2008), and marketing (Mueller, Lockshin, & Louviere, 2010). More efficient and richer discrete-choice elicitation is provided by best–worst scaling, where respondents select both the best option and worst option from a set of alternatives. For example, a respondent presented with six bottles of wine might be asked to report her most and least preferred bottles. Data collection using best–worst scaling has been increasingly used, particularly in studying consumer preference for goods or services (Collins & Rose, 2011; Flynn, Louviere, Peters, & Coast, 2007, 2008; Lee et al., 2008; Louviere & Flynn, 2010; Louviere & Islam, 2008; Marley & Pihlens, 2012; Szeinbach et al., 1999).

In applied fields, best–worst data are often analyzed using conditional logit (also called multinomial logit, MNL) models; the basic model is also known in cognitive science as the Luce choice model (Luce, 1959). These models assume that each option has a utility (u, also called “valence” or “preference strength”) and that choice probabilities are simple (logit) functions of those utilities (Finn & Louviere, 1992; Marley & Pihlens, 2012).1 MNL models provide compact descriptions of data and can be interpreted in terms of (random) utility maximization but afford limited insight into the cognitive processes underpinning the choices made. MNL models also do not address choice response time,2 a measure that has become easy to obtain since data collection was computerized.

We explore the application of modern, comprehensive evidence accumulation models—typically employed to explain simple perceptual choice tasks—to the complex decisions involved in best–worst choice between multi-attribute options. Our application bridges a divide between relatively independent advances in theoretical cognitive science (computational models of cognitions underlying simple decisions) and applied psychology (best–worst scaling to elicit maximal preference information from respondents). The result is the best of both worlds: a more detailed understanding of the cognitions underlying preference but without loss in the statistical properties of measurement and estimation.

As summarized in the next section, previous work in this direction has been hampered by computational and statistical limitations. We show that these issues can be overcome using the recently developed linear ballistic accumulator (LBA: Brown & Heathcote, 2008) model. We do so by applying mathematically tractable LBA-based models to two best–worst scaling data sets: one involving patient preferences for dermatology appointments (Coast et al., 2006), and another involving preference for aspects of mobile phones (Marley & Pihlens, 2012). In these applications, chosen to demonstrate the applicability of our methodology to diverse fields and measurement tasks, we show that previously published MNL utility estimates are almost exactly linearly related to the logarithms of the estimated rates of evidence accumulation in the LBA model; this is the relation that might be expected from the role of the corresponding measures in the two types of models. We follow this demonstration with an application to a perceptual judgment task that uses the best–worst response procedure, with precise response time measurements, to demonstrate the benefit of response time information in understanding the extra decision processes involved in best–worst choice.

In the first section of this article, we describe evidence accumulation models, and the LBA model in particular. We then develop three LBA-based models for best–worst choice that are motivated by assumptions paralleling those previously used in corresponding random utility models of best–worst choice. We show that those LBA models describe the best–worst choice probabilities at least as well as the random utility models. However, in the second section, we show that the three earlier LBA models are falsified by the response time data from the best–worst perceptual task. We then modify one of those LBA models to account for all features of the response time, and hence the choice, data. We conclude that response time data further our understanding of the decision processes in best–worst choice tasks, and that the LBA models remedy a problem with the random utility models, by providing a plausible cognitive interpretation.

2. Accumulator models for preference

Models of simple decisions as a process of accumulating evidence in favor of each response option have over half a century of success in accounting not only for the choices made but also the time taken to make them (for reviews, see Luce, 1986; Ratcliff & Smith, 2004). When there are only two response choices, these models sometimes have only a single evidence accumulation process (e.g., Ratcliff, 1978), but if there are more options then it is usual to assume a corresponding number of evidence accumulators that race to trigger a decision (e.g., van Zandt, Colonius, & Proctor, 2000). Multiple accumulator models have provided comprehensive accounts of behavior when deciding which one of several possible sensory stimuli has been presented (e.g., Brown, Marley, Donkin, & Heathcote, 2008) and even accounted for the neurophysiology of rapid decisions (e.g., Forstmann et al., 2008; Frank, Scheres, & Sherman, 2007). Accumulator models have also been successfully applied to consumer preference, most notably decision field theory (Busemeyer & Townsend, 1992; Roe, Busemeyer, & Townsend, 2001), the leaky competing accumulator model (Usher & McClelland, 2004), and, most recently, the 2N-ary choice tree model (Wollschläger & Diederich, 2012).

Accumulator models provide detailed mechanistic accounts of the processes underlying decisions—accounts that are lacking in random utility models. Nevertheless, the cognitive models mentioned above have practical disadvantages that do not apply to classic random utility models of choice. The most important practical problem is that those cognitive models do not have closed form expressions for the joint likelihood of response choices and response times. When only response choices are considered, some of these models do have such solutions, but when response times are included as well, the likelihood functions have to be approximated either by Monte-Carlo simulation or by discretization of evidence and time (Diederich & Busemeyer, 2003). The Monte–Carlo methods make likelihoods very difficult to estimate accurately, and the discretization approach can prove impractical when there are more than two options in the choice set. A newer cognitive model (Otter, Allenby & van Zandt, 2008; Ruan, MacEachern, Otter & Dean, 2008) has managed to alleviate some of the practical problems of computation, but it is also limited by its underlying Poisson accumulator model, which—in contrast to the LBA model (see Brown & Heathcote, 2008)—has been shown to provide an incomplete account of standard perceptual decision data (see Ratcliff & Smith, 2004).

Like other multiple accumulator models, the original LBA is based on the idea that the decision maker accumulates evidence in favor of each choice and makes a decision as soon as the evidence for any choice reaches a threshold amount. This simple “horse race” architecture makes the model simple to analyze and use but does not naturally explain subtle preference reversals and context effects—a point we return to below. For the LBA model, the time to accumulate evidence to threshold is the predicted decision time, and the response time is the decision time plus a fixed offset (t0), which accounts for processes such as response production. Fig. 1 gives an example of an LBA decision between options A and B, represented by separate accumulators that race against each other. The vertical axes represent the amount of accumulated evidence, and the horizontal axes the passage of time. Response thresholds (b) are shown as dashed lines in each accumulator, indicating the quantity of evidence required to make a choice. The amount of evidence in each accumulator at the beginning of a decision (the “start point”) varies independently between accumulators and randomly from choice to choice, sampled from a uniform distribution: U(0, A), with A ≤ b. Evidence accumulation is linear, as illustrated by the arrows in each accumulator of Fig. 1. The speed of accumulation is traditionally referred to as the “drift rate,” and this is assumed to vary randomly from accumulator to accumulator and decision to decision according to an independent normal distribution for each accumulator, reflecting choice-to-choice changes in factors such as attention and motivation.

Figure 1.

Illustrative example of the decision processes of the original linear ballistic accumulator (LBA).

Mean drift rate reflects the attractiveness of an option: A higher value gives a faster rise to threshold and therefore a more likely choice outcome. For example, in Fig. 1 suppose option A has a mean drift rate of 0.6 (dA = 0.6) and option B a mean drift rate of 0.4 (dB = 0.4). When options A and B are presented together, a choice of response A is more likely on average since option A has a larger drift rate than option B, and so will usually reach threshold first. However, since there is noise in the decision process (in both start point and drift rate), on some decisions the accumulator for response B will occasionally reach threshold first, leading to a choice of option B. Noise in the decision process allows the LBA to account for the observed variability in decision making, successfully predicting the joint distribution of response times and response choices across a wide range of tasks (e.g., Brown & Heathcote, 2008; Forstmann et al., 2008; Ho, Brown, & Serences, 2009; Ludwig, Farrell, Ellis, & Gilchrist, 2009). Here, we employ a slightly modified LBA where drift rates are drawn from strictly positive truncated normal distributions (for details, see Heathcote & Love, 2012).3

3. Horse race models for best–worst choice

We use the modified LBA to create three different models for best–worst choice, derived from previously applied random utility models of choice (e.g., Marley & Louviere, 2005). Each variant involves a race among accumulators representing the competing choices. In the first, which we refer to as the ranking model, there is one race, with the first (respectively, last) accumulator to reach threshold associated with the best (respectively, worst) choice. The second variant, the sequential model, has two races that occur in sequence; the winner of the first race determines the best response, and, omitting the winner of the first race, the winner of the second race determines the worst response; for simplicity, we constrain each drift rate for the second race to be the inverse of the corresponding rate in the first race. The third variant, the enumerated model, assumes a single race between accumulators that represent each possible (distinct) pair of best and worst choices (e.g., 4 × 3 = 12 accumulators for a choice between four options). For a best–worst pair (p, q), p ≠ q, the drift rate is the ratio d(p)/d(q) of the drift rate for p versus the drift rate for q. Our later fits to data show that the estimated drift rate d for an option is, effectively, equal to the exponential of the estimated utility u for that option in the corresponding MNL model. These are not the only plausible accumulator-based best–worst models—for example, best–worst choice can be modeled by two parallel races driven by utilities and disutilities, respectively. In fact, the latter model turns out to be the superior model when we later fit both the choices made and the time to make them. However, we consider the other models first because they correspond to the frameworks adopted by the marginal and paired-conditional (“maxdiff”) random utility models most commonly used to analyze best–worst scaling data (Marley & Louviere, 2005).

3.1. General framework

In this section, we present equations for the predicted choice probabilities and response times (omitting fixed offset times for processes like motor production). To begin the notation, let S with |S| ≥ 2 denote the set of potentially available options, and let X ⊆ S be a finite subset of options that are available on a single choice occasion. Assume there is a common threshold b across all options in the set of available options X. For each z ∊ S and pq ∊ S, p ≠ q, there are a best drift rate d(z); independent random variables Uz, Up,q uniformly distributed on [0, A] for some 0 ≤ A ≤ b; and independent normal random variables Dz and Dp,q with mean 0 and standard deviation s; also remember that we assume the worst drift rate for z equals 1/d(z). For best choices, the drift rate variable for option z is then given by trunc(Dz + d(z)), where the truncation is to positive values, and for worst choices, the drift rate variable for option z is given by trunc(Dz + 1/d(z)), again with the truncation to positive values. Similarly, the drift rate variable for the best–worst pair pq, is given by trunc(Dp,q + d(p)/d(q)). For best choices, the probability density function (PDF) of finishing times for the accumulator for option z ∊ X at time t, denoted bz(t), is given by the following:

display math

with cumulative density function (CDF)

display math

We denote the corresponding PDF and CDF for worst choice by wz(t) and Wz(t), which are given by replacing d(z) with 1/d(z) in the above formula. For best–worst choice, they are math formula and BW(p,q)(t) with d(p)/d(q) replacing d(z) in the above formulae. Expressions given in Brown and Heathcote (2008 can be used to derive easily computed expressions for these PDFs and CDFs under the above assumptions, and using drift rate distributions truncated to positive values as described in Heathcote and Love (2012). Then, given the assumption that the accumulators are independent—that is, each accumulator has independent samples of the start point and drift rate variability—it is simple to specify likelihoods conditional on response choices and response times. These likelihoods for each of the three best–worst LBA models are shown in the next three subsections.

When fitting data without response times, we made the simplifying assumptions that s = 1, b = 1, and A = 0, as these parameters are only constrained by latency data. Even in response time applications, fixing s in this way is common. Estimation of b can accommodate variations in response bias and overall response speed, but without response time data, speed is irrelevant and response bias effects are expressed in drift rate estimates. The relative sizes of A and b are important in accounting for differences in accuracy and decision speed that occur when response speed is emphasized at the expense of accuracy, or vice versa (Brown & Heathcote, 2008; Ratcliff & Rouder, 1998). Our setting here (A = 0) is consistent with extremely careful responding. We also tried a less extreme setting (b = 2, A = 1) with similar results.

3.2. Ranking model

The ranking model is arguably the simplest way to model best–worst choice with a race. For a choice among n options it assumes a race between n accumulators with drift rates d(z). The best option is associated with the first accumulator to reach threshold and the worst option with the last accumulator to reach threshold, as shown in the upper row of Fig. 2. To link the models to data requires an expression for the probability of choosing a particular best–worst pair of distinct options for each possible choice set, X, given model parameters. The PDF for a choice of option x as the best at time t, and option y as the worst at time r, where r > t, denoted bwX(x, t; y, r), is given by Equation 1 shown in Fig. 2, for xy ∊ X, x ≠ y. Equation 1 calculates the product of the probabilities of the “best” accumulator finishing at time t, the “worst” accumulator finishing at time r > t, and all the other accumulators finishing at times between t and r. Since the data sets do not include response times, we calculate the marginal probability BWX(xy) of the selection of this best–worst choice pair, by integrating over the unobserved response times (t and r)—see Equation 2 in Fig. 2.

Figure 2.

Illustrative example of the decision processes of the ranking, sequential, and enumerated versions of the best–worst race models and their associated formulae (upper, middle, and lower rows, respectively). See main text for full details.

The ranking model predicts that the best option is chosen before the worst option, which may not be true in data. The ranking model could also be implemented in a worst-to-best form, with the first finishing accumulator associated with the worst option and the last finishing accumulator with the best option, and with each drift rate the inverse of the corresponding drift rate in the best-to-worst version. These versions cannot be discriminated without response time data, which we return to later.

3.3. Sequential model

The sequential race model assumes that the best–worst decision process is broken into two separate races that occur consecutively (see middle row of Fig. 2). The best race occurs first and selects the best option. The worst race then selects the worst option. The sequential model could have 2 × n mean drift rate parameters—one for each option in the best race, and one for each option in the worst race. To reduce the number of free parameters, we assume that, for each z ∊ S, the drift rate is d(z) in the best race and 1/d(z) in the worst race. This ensures that desirable choices (high drift rates) are both likely to win the best race and unlikely to win the worst race. We also assume that the worst-choice race does not include the option already chosen as best so that the same option cannot be chosen as both best and worst.

The first accumulator to reach threshold is selected as the best option. The probability for a choice of the option x as the best, BX(x), is given in Equation 3. It is the probability that accumulator x finishes at time t, and all the other accumulators finish at times later than t, integrated over all times t > 0. The worst race is run with the accumulators in the set X − {x}. The first accumulator to reach threshold is selected as the worst option, but each drift rate is now the inverse of the corresponding drift rates in the first race. The probability of a choice of the option y as the worst, WX-{x}(y), is given in Equation 4. The joint probability for the sequential model of choosing x as the best and y as the worst, BWX(xy), is simply the product of Equations 3 and 4.

Clearly, a corresponding model is easily developed where the worst race occurs first and selects the worst option, and the best race occurs second and selects the best option.

3.4. Enumerated model

The enumerated model assumes a race between each possible best–worst pair of distinct options in the choice set. This is analogous to one interpretation of the paired conditional (also called “maxdiff”) random utility model (Marley & Louviere, 2005; and our Appendix A). We later present an interpretation of this choice model that leads to the parallel best–worst LBA. For a choice set with n options, the model assumes a race between n × (n − 1) accumulators. This model predicts a single decision time for both responses. The probability of choosing x as best and y ≠ x as worst for this model is shown in Equation 5.

For a choice set with n options, the enumerated model could have n × (n − 1) drift rate parameters. However, we simplified the enumerated model in a similar way to the sequential model, by again defining the desirability of options by their drift rates, and the undesirability of options by the inverse of their drift rates. In particular, we estimated a single drift rate parameter d(z) for each choice option z and set the drift rate for the accumulator corresponding to the option pair (pq) to the ratio d(p)/d(q).

3.5. Estimating model parameters from data

We fit the three race models to two best–worst choice data sets, one about patients' preferences for dermatology appointments (Coast et al., 2006) and the second about preferences for mobile phones (Marley & Pihlens, 2012). In both data sets, response times were unavailable (i.e., not recorded) and the data structure was “long but narrow”—large sample sizes with relatively few data points per participant—which is common in discrete choice applications. Coast et al.'s data investigated preferences for different aspects of dermatological secondary care services. Four key attributes were identified as relevant to patient experiences, one of which had four levels, with the remaining three having two levels each (see Table 1). We denote each attribute/level combination (henceforth, “attribute level”) with two digits, shown in parentheses in the right column of Table 1. The first digit refers to the attribute and the second digit to its level: for example, attribute level “32” refers to level 2 of attribute number 3. For all four attributes, the larger the second digit (i.e., level of the attribute) the more favorable the level of the attribute.

Table 1. The four attributes and their levels from Coast et al. (2006)
AttributeAttribute Levels
  1. Note. The values in parentheses indicate the coding used in Fig. 3 below.

Waiting time (1)Three months (11)
Two months (12)
One month (13)
This week (14)
Doctor expertise (2)The specialist has been treating skin complaints part-time for 1–2 years (21)
The specialist is in a team led by an expert who has been treating skin complaints full-time for at least 5 years (22)
Convenience of appointment (3)Getting to the appointment will be difficult and time consuming (31)
Getting to the appointment will be quick and easy (32)
Thoroughness of consultation (4)The consultation will not be as thorough as you would like (41)
The consultation will be as thorough as you would like (42)

Participants were given a description of a dermatological appointment that included a single level from each attribute and were asked to indicate the best and the worst attribute level. For example, on one choice occasion, a participant might be told that an upcoming appointment is 2 months away (12); with a highly specialized doctor (22); at an inconvenient location (31); and not very thorough (41). The participant would then be asked to choose the best and worst thing about this appointment. The same 16 scenarios were presented to each participant in the study and were chosen using design methodology that enabled all main effects to be estimated. Below, we compare the parameters estimated from the race models to parameters of the MNL models for Coast et al.'s (2006) choice data reported by Flynn et al. (2008).

Marley and Pihlens (2012) examined preferences for various mobile phones among 465 Australian pre-paid mobile phone users in December 2007. The phones were described by nine attributes with a combined thirty-eight attribute levels, described in Table 2. The values in parentheses in Table 2 code attribute levels in a manner similar to Table 1, though not all attributes have a natural preference order. Each respondent completed the same 32 choice sets with four profiles per set. Each phone profile was made up of one level from each of the nine attributes. Participants were asked to provide a full rank order of each choice set by first selecting the best profile, then the worst profile from the remaining three, and finally the best profile from the remaining two. Here, we restrict our analyses to choices of the best and the worst profile in each choice set.

Table 2. The nine attributes and their combined thirty-eight levels from Marley and Pihlens (2012)
AttributeAttribute Levels
  1. Note. The values in parentheses indicate the coding used in Figs. 4 and 5 below.

Phone style (s)Clam or flip phone (1)
Candy bar or straight phone (2)
Slider phone (3)
Swivel phone (4)
Touch screen phone (5)
PDA phone with a HALF QWERTY keyboard (6)
PDA phone with a FULL QWERTY keyboard (7)
PDA phone with touch screen input (8)
Brand (b)A (1)
B (2)
C (3)
D (4)
Price (p)$49.00 (1)
$129.00 (2)
$199.00 (3)
$249.00 (4)
Camera (c)No camera (1)
2 megapixel camera (2)
3 megapixel camera (3)
5 megapixel camera (4)
Wireless connectivity (w)No bluetooth or WiFi connectivity (1)
WiFi connectivity (2)
Bluetooth connectivity (3)
Bluetooth and WiFi connectivity (4)
Video capability (v)No video recording (1)
Video recording (up to 15 min; 2)
Video recording (up to 1 h; 3)
Video recording (more than 1 h; 4)
Internet capability (i)Internet access (1)
No Internet access (2)
Music capability (m)No music capability (1)
MP3 music player only (2)
FM radio only (3)
MP3 music player and FM radio (4)
Handset memory (r)64 MB built-in memory (1)
512 MB built-in memory (2)
2 GB built-in memory (3)
4 GB built-in memory (4)

3.5.1. Parameter constraints

A natural scaling property of the LBA model is that the drift rate distributions for competing choices can be multiplied by an arbitrary positive scale factor without altering predicted choice probabilities4—although response time predictions will be affected. We employed a modified LBA model, with truncated normal drift rate distributions. In this version, multiplying the drift rate parameters does not simply multiply all distributions. Rather, the distribution shape is altered because the amount of truncation changes depending on how far the mean of the distribution falls from zero. For this reason, the scaling property holds only approximately for the truncated-normal LBA, and the closeness of the approximation depends on the size of the drift rate parameters.

For the mobile phone data, there were 38 different attribute levels, and the approximation to the regular scaling property held well enough that we were able to constrain the product of the estimated drift rates across the attribute levels to one, for each attribute. This results in 29 free drift rate parameters and mirrors Marley and Pihlens's (2012) constraints on their MNL model parameters. The dermatology data involved more extreme choice probabilities, and so smaller drift rates for some attributes. Therefore, we were not able to exploit the usual scaling property, and we imposed no constraints: There were 10 attribute levels, and we estimated 10 free drift rates. We note that this freedom allows us, theoretically, to separately estimate mean utility parameters and the associated variance parameters, which may prove useful in future research (Flynn, Louviere, Peters, & Coast, 2010).

3.5.2. Model fit

We aggregated the data across participants; thus, for each choice set in the design, we fit a single set of choice probabilities. We used two methods to evaluate the fit of the race models to data. The first compared drift rate estimates to corresponding random utility models regression coefficients reported by Flynn et al. (2008) and Marley and Pihlens (2012). In this comparison, we use the logarithm of the drift rates, bringing them onto the same unbounded domain as the utility parameters. Second, we examined the race models' goodness-of-fit by comparing observed and predicted best–worst choice proportions. For each choice set, observed best–worst choice proportions were calculated by dividing the number of times a particular best–worst pair was selected across participants by how many times that particular choice set was presented across participants.

4. Model fits

4.1. Coast et al.'s (2006) dermatology data

Flynn et al. (2008) analyzed Coast et al.'s (2006) data using a paired model conditional logit regression, adjusted for covariates.5 Flynn et al.'s regression coefficients were expressed as treatment-coded linear model terms (main effects for attributes, plus treatment effects for each level). For example, Flynn et al. found that the main effect for the “convenience” attribute was 0.715 with a treatment effect of 1.501 for the “very convenient” level (Table 4 in Flynn et al.). For ease of comparison, we expressed the estimated drift rate parameters from the LBA models in this same coding. We referenced all parameters against the zero point defined by the waiting time attribute, by subtracting the mean drift rate for this attribute from all drift rates. We then calculated the main effect for each attribute as the mean drift rate for that attribute and calculated treatment effects for each attribute level by subtracting the main effects. These calculations are independent of the parameter estimation procedure and serve only to facilitate comparison with Flynn et al.'s results.

The upper row of Fig. 3 compares the log drift rates estimated from the three race models against the corresponding parameters from Flynn et al.'s (2008) fit of the maxdiff (MNL) model; Appendix A gives the form of that model. The four main effect estimates are shown as boldfaced single digits, and treatment effects as regular faced double digits, using the notation from Table 1. For all three model variants, there was an almost perfect linear relationship between log drift rates estimated for the race model and the parameters for the corresponding MNL model.

Figure 3.

Log drift rate parameter estimates (upper row), plotted against Flynn et al.'s (2008) utility estimates, and goodness-of-fit (lower row) of the ranking, sequential, and enumerated race models (left, middle, and right columns, respectively) to Coast et al.'s (2006) data. In the upper row, boldface single digits represent main effects, and double digits represent attribute levels, using the notation from Table 1. In the lower row, the x-axes display best–worst choice proportions from data, and the y-axes display predicted best–worst choice probabilities from the estimated race model. The diagonal lines show a perfect fit.

All of the race models provided an excellent fit to the dermatology data, as shown in the lower row of Fig. 3. In those plots, a perfect fit would have all the points falling along the diagonal line. For all models, there was close agreement between observed and predicted values, with all R2's above .9. The root-mean-squared difference between observed and predicted response probabilities was 5.4%, 5.1%, and 5.3% for the ranking, sequential, and enumerated models, respectively. The corresponding log-likelihood values were −1,379, −1,406, and −1,381, respectively, providing little basis to select between LBA models in this analysis. Flynn et al.'s (2008) marginal model conditional logit analysis, which has the same number of free parameters as our LBA models, produced a log-pseudolikelihood of −1,944, suggesting that the LBA provides a better fit to these data.

4.2. Marley and Pihlens's (2012) mobile phone data

Marley and Pihlens (2012) analyzed their full rank data using a repeated maxdiff (MNL) model;6 Appendix A gives the form of that maxdiff model for the first (best) and last (worst) options in those rank orders. We compare the log drift rate parameter estimates from our race models for those first (best) and last (worst) choices against Marley and Pihlens's utility parameter estimates. The upper row of Fig. 4 plots the estimated log drift rates from the race models against the regression coefficients from Marley and Pihlens's maxdiff model. There was again a nearly perfect linear relationship between log drift rates estimated for the race model and regression coefficients.

Figure 4.

Log drift rate parameter estimates (upper row), plotted against Marley and Pihlens's (2012) regression coefficients, and goodness-of-fit (lower row) of the ranking, sequential, and enumerated race models (left, middle, and right columns, respectively) to Marley and Pihlens's mobile phone data. In the upper row, each point represents an attribute level, where the letter indicates the attribute and the number indicates the level of the attribute, as in Table 2. In the lower row, the x-axes display best–worst choice proportions from data, the y-axes display predicted best–worst choice probabilities from the estimated race model, and the diagonal lines represent a perfect fit.

To further demonstrate the strength of the linear relationship between drift rates and utility estimates, we represent the parameter values for the sequential model shown in Fig. 4 separately for each attribute, in Fig. 5. The separate plots clearly illustrate the almost perfect correspondence between the rank ordering on drift rates and the rank ordering on regression coefficients. Not only do the drift rates preserve the ordering but also differences in magnitude between levels of each attribute. For example, the price attribute, shown in the upper right panel of Fig. 5, demonstrates that people have the strongest preference for the cheapest phones ($49) and the weakest preference for the most expensive ones ($249). However, the difference in utility (regression coefficients) is much greater between some adjacent levels than others (e.g., moving from the third to the fourth level, $199 to $249). Such a difference in magnitude also occurred in, for instance, the camera attribute, where a phone with no camera (level 1) was much less desirable than any phone with a camera (levels 2, 3, and 4). In all cases, the estimated drift rates were sensitive to such differences in magnitude as well as the rank ordering. Sensitivity to these important outcome measures (ranking and magnitude) suggests the race models may be useful for measurement purposes.

Figure 5.

Fit of the sequential race model to Marley and Pihlens' (2012) mobile phone data shown separately for each attribute. Log drift rates are plotted against Marley and Pihlens' regression coefficients. Each panel represents a different attribute. The numbers inside the nine panels represent each attribute level. The black lines in each panel represent the regression line fit to the sequential race model log drift rates and Marley and Pihlens' regression coefficients shown in the middle panel of the upper row in Fig. 4. The dashed horizontal and vertical lines represent zero-reference points.

We assessed the goodness-of-fit of each model by comparing observed and predicted proportions, shown in the lower half of Fig. 4. As with the dermatology data, there was excellent agreement, with 3.2% root-mean-squared prediction error for all three models. Although the goodness-of-fit appears poorer in Fig. 4 compared to Fig. 3, this is actually due to differences in the scale of the axes across figures. The R2's were smaller than for the dermatology data (though all are > .57), reflecting greater interindividual variability in choices. The best–worst LBA models reported here have the same number of free parameters as the MNL models reported by Marley and Pihlens (2012), so following Marley and Pihlens we compared goodness-of-fit between models with McFadden's ρ2 measure. McFadden's ρ2 measures the fit of a full model with respect to the null model (all parameters equal to 1), defined as math formula, where math formula and math formula refer to the estimated log-likelihood of the full and null models, respectively. This showed almost identical results for the race models (ranking ρ2 = .245, sequential ρ2 = .246, and enumerated ρ2 = .246) as the MNL models (best, then worst, ρ2 = .244, and maxdiff ρ2 = .245). This suggests that the race models fit the data as well as the MNL models.

As a final comparison with existing MNL models for best–worst data, we compared the race models' drift rate parameters against “best minus worst scores” calculated from the data.7 As with similar analyses in the literature (Finn & Louviere, 1992; Goodman, 2009; Mueller Loose & Lockshin, 2013), for both data sets the agreement between the best minus worst scores and the drift rates was just as strong as the relationship between drift rates and regression coefficients, reinforcing the current consensus that the best minus worst scores are a simple, but useful, way to describe data. Theoretical properties of these scores for the maxdiff model of best–worst choice are stated and proved in Flynn and Marley (2013), Marley and Islam (2012), and Marley and Pihlens (2012).

4.3. Discussion

The three MNL-inspired LBA model variants we proposed are all capable of fitting the dermatology and mobile phone data sets at least as well as the standard MNL choice models. Although each of the models makes different assumptions about the cognitive processes underlying best–worst choices, all fit the choice data equally well, making them difficult to distinguish on the basis of choices alone. Response time data have the potential to tease the models apart—for example, the ranking LBA model makes the strong prediction that “best” responses will always be faster than “worst” responses. Testing such predictions against data can better inform investigations into the cognitive processes underlying preferences, paralleling similar developments in the understanding of single-attribute perceptual decisions (e.g., see Ratcliff & Smith, 2004) and best-only decisions about multi-attribute stimuli (Otter et al., 2008; Ruan et al., 2008). This illustrates the potential benefits that arise from using cognitive process-based models (such as accumulator models) for both choice and response time.

In the next section, we demonstrate that a best–worst scaling task that incorporates response time measurement can aid discrimination between the LBA variants we have proposed. We show that the three LBA variants introduced above, derived from analogous MNL models, are inconsistent with the response time data from a perceptual judgment task. We propose a modification to the sequential LBA model that naturally accounts for the response latency data, demonstrating that response times provide added benefit to best–worst scaling.

5. Response times in best–worst scaling

The ranking, sequential, and enumerated race models make unique predictions about the pattern of predicted response times. For example, the ranking and sequential models predict that best responses always occur before worst responses. Alternatively, the two models could be instantiated in a worst-to-best fashion, in which case they predict worst responses always occur before best responses. If the data exhibit a mixture of response ordering—sometimes the “best” before “worst,” and vice versa—then we have evidence against the assumptions of these two models, at least for their strict interpretation. We discuss below implications of response order patterns in data, and the possible inclusion of a mixture process that permits variability in the order of the races (i.e., the ranking model might sometimes occur in a worst-to-best manner and sometimes in a best-to-worst manner, or the sequential model might occur in a worst-then-best order and sometimes in a best-then-worst order). The enumerated model also makes strong predictions about response times: Best and worst choices should differ only by an offset time due to motor processes, since the single enumerated race provides both the best and worst responses. Consequently, the enumerated model predicts that experimental manipulations, such as choice difficulty or choice set size, should not influence the time interval between the best and worst responses.

As a first target for investigating models of response times in best–worst scaling experiments, we used a simple perceptual judgment task rather than a traditional consumer choice task. This enabled us to collect a large number of trials per participant, providing data that supported a finer grained analysis of response time distributions and analysis of individual participants' data. In addition, perceptual tasks permit precise stimulus control that allows testing of, for example, enumerated model predictions such as the absence of choice difficulty effects on interresponse times. We leave to future research the investigation of response time data from more typical multi-attribute discrete choice applications, such as the dermatology and mobile phone examples examined in the first section.

6. Experiment

We used a modified version of Trueblood, Brown, Heathcote, and Busemeyer (2013) area judgment task. At each trial, participants were presented with four rectangles of different sizes, and they were asked to select the rectangle with the largest area (i.e., an analog to a “best” choice) and the smallest area (i.e., an analog to a “worst” choice).

6.1. Participants

Twenty-six first-year psychology students from the University of Newcastle participated in the experiment online in exchange for course credit.

6.2. Materials and methods

The perceptual stimuli were adapted from Trueblood et al. (2013) where participants were asked to judge the area of black shaded rectangles presented on a computer display. We factorially crossed three widths with three heights to generate nine unique rectangles, with widths, heights, and areas given in Table 3. The area of the rectangles at the extreme ends of the stimulus set were easily differentiable (e.g., 6,050 from 8,911), but those in the middle of the stimulus set were much more difficult (e.g., 8,107 from 8,113).

Table 3. The nine rectangular stimuli generated by factorially crossing three rectangle widths with three rectangle heights
WidthHeightArea
  1. Note. All measurements are in pixels.

551106,050
551216,655
551337,315
611106,710
611217,381
611338,113
671107,370
671218,107
671338,911

On each trial, four rectangles were randomly sampled, without replacement, from the set of nine rectangles. The stimuli were presented in a horizontal row in the center of the screen, as shown in Fig. 6. All rectangles were subject to a random vertical offset between ± 25 pixels, to prevent the use of alignment cues to judge height.

Figure 6.

Illustrative example of a trial in the area judgment task. Note that if a participant selected, say, “Largest 1” as the rectangle with the largest area, then the option “Smallest 1” was made unavailable for selection.

Each participant chose the rectangle judged to have the largest area, and a different one with the smallest area. All responses were recorded with a mouse click and could be provided in either order: largest-then-smallest or smallest-then-largest. We restricted participants from providing the same rectangle as both the largest and smallest option, by removing the option selected first as a possibility for the second response. On each trial, we recorded the rectangle chosen as largest and the time to make the choice, and the rectangle chosen as smallest and the time to make that choice. Each participant completed 600 trials across three blocks.

7. Results

We excluded trials with outlying responses that were unusually fast (less than .5 s) or slow (more than 25 s). We also excluded two participants who had more than 10% of their trials marked as outliers. Of the remaining participants' data, outliers represented only 0.9% of total trials.

We first report the proportion of correct classifications—correct selection of the largest (respectively, smallest) rectangle in the stimulus display. We follow this analysis by considering the effect of response order—whether participants responded in a largest-then-smallest, or smallest-then-largest, manner—on both choice proportion and response latency data.

7.1. Correct classifications

Our first step in analysis was to determine whether the area judgment manipulation had a reliable effect on performance. The left and middle panels of Fig. 7 display the proportion of correct responses for largest (respectively, smallest) judgments as a function of choice difficulty. We operationalized difficulty as the difference in area between the largest (respectively, smallest) and second largest (respectively, second smallest) rectangle presented at each trial, which we refer to as the max-versus-next (respectively, min-vs.-next) difference. A small max-versus-next (respectively min-vs.-next) difference in area makes it difficult to resolve which is the largest (respectively, smallest) rectangle.

Figure 7.

The left panel shows the proportion of times that the rectangle chosen as best was the largest rectangle (i.e., correct choice) as a function of the difference in area between the largest and second largest rectangles in the display (in pixels). The middle panel shows the proportion of times that the rectangle chosen as worst was the smallest rectangle as a function of the difference in area between the smallest and second smallest rectangles in the display. Error bars represent one standard error of the mean. The overlaid lines represents the best-fitting cumulative normal psychophysical functions, fit separately to both panels, with legend shown in the right panel.

Performance was well above chance even for the smallest max-versus-next and min-versus-next differences (6 and 11 pixels, respectively), yielding 45% and 50% correct selections of the largest and smallest rectangles in the display, respectively (chance performance is 25%). As expected, when the max-versus-next and min-versus-next difference increased, so did the proportion of correct responses. We have separately overlaid on the max-versus-next and min-versus-next difference scores the best-fitting cumulative normal psychophysical functions. The good fit is consistent with the notion that participants' decisions were sensitive to a noisy internal representation of area.

7.2. Response order effects

To connect the following model and data with earlier material, we refer to best (respectively, worst) rather than largest (respectively, smallest). There was considerable variability across participants in the proportion of best-before-worst versus worst-before-best responses (Fig. 8), ranging from almost completely worst-first to almost completely best-first. Over participants, the majority of first responses were for the best option (M = .73), which was significantly different from chance according to one sample t-test (i.e., test against μ = .5), t(23) = 3.18, p = .004. These patterns suggest that models which strictly impose a single response ordering, such as the ranking and sequential models, require modification.

Figure 8.

Proportion of trials in which the best response was made before the worst response, shown separately for each participant. Circular symbols show data, and crosses represent predictions of the parallel race model. Error bars represent the standard error of a binomial proportion, according to the formula: math formula, where math formula is the proportion of best-first choices in data and n is the number of trials. The dashed horizontal line represents the mean proportion of best-first responses across participants.

We next examined whether choice difficulty influenced response order. We defined the difficulty of best and worst choices, respectively, using the max-versus-next and min-versus-next criteria described above, but we collapsed these into three exhaustive difficulty categories: hard (less than 250 pixels), medium (500–1,000 pixels), and easy (greater than 1,250 pixels; no difference scores fell between 250 and 500 or between 1,000 and 1,250 pixels). The mean proportion of best-first responses reliably decreased as the discrimination of the largest rectangle became more difficult, F(1.4, 31.9) = 4.07, p = .04, using Greenhouse-Geisser adjusted degrees of freedom (all subsequent ANOVAs also report Greenhouse-Geisser adjusted degrees of freedom). This effect suggests that when there is an easy-to-see largest rectangle, that response was likely to be made first: easy M = .739 (within-subjects SE = .006), medium M = .726 (.003), and hard M = .719 (.006). The analogous result was observed for worst-first choices, but larger in effect size—an easy-to-see smallest rectangle made a worst-first response more likely, F(1.6, 35.7) = 10.03, p < .001; easy M = .289 (.006), medium M = .275 (.004), and hard M = .256 (.006).

We also examined the effect of choice difficulty on the time interval between the first and second responses—the interresponse time. For each participant, we calculated the average interresponse time for best-first trials as a function of the difficulty of the worst (second) response (easy, medium, difficult), and for worst-first trials as a function of the difficulty of the best (second) response. As the second judgment became more difficult, the latency between the first and second responses increased, approximately half a second across the three difficulty levels, F(1.2, 24.8) = 10.81, p = .0028; easy M = 1.51 s (.12), medium M = 1.75 s (.08), and hard M = 1.95 s (.109).

7.3. Discussion

Data from the best–worst perceptual choice experiment exhibited three effects relevant to testing the models: large differences between participants in the preference for best-then-worst versus worst-then-best responding; changes in the proportion of best-first and worst-first responding as a function of choice difficulty; and changes in interresponse times due to choice difficulty. These effects are inconsistent with all three MNL-derived race models in their original forms. First, the three models cannot accommodate within-participant, across-trial variability in best- or worst-first responding. They might be able to account for the response order effects through the addition of a mixture process. On a certain proportion of trials, defined by a new parameter, the ranking race or the sequential races could be run in reverse order, or the enumerated model could execute its responses in opposite orders (Marley & Louviere, 2005, considered such mixture models for best-then-worst and worst-then-best choice).

The mixture approach adds a layer of complexity to the model—an extra component outside the choice process itself—which is unsatisfying. Putting that objection aside, these changes would not be able to account for the effect of choice difficulty on the proportion of best- and worst-first responses, or on the interresponse times, because the mixture process is independent of the choice process. For these reasons, we do not explore the fit to response time data of the ranking, sequential, or enumerated models when augmented with a mixture process. Instead we propose a modification to the sequential model, preserving the idea that there are separate best-choice and worst-choice races, but assuming that they occur in parallel rather than sequentially. This best–worst race model is an extension of Marley and Louviere's (2005, Sect. 4.1.2, Case 2) process model for best–worst choice (which, in its general form, does not make MNL assumptions) to both choices and response times.

8. The parallel model

Where the sequential model assumes consecutive best, then worst, races, the parallel model assumes concurrent best and worst races. The best option is associated with the first accumulator to reach threshold in the best race, and the worst option is associated with the first accumulator to reach threshold in the worst race. We present, and test, the simplest version of this model which allows the same option to be selected as both best and worst, which was not allowed in our experiment. The model also allows for vanishingly small interresponse times, which are not physically possible. These properties affect a sufficiently small proportion of decisions for our rectangle data that we neglect them here, for mathematical convenience.

The probability of a choice of the option x as best at time t and option y as worst at time r, where no constraint exists between t and r, is the product of the individual likelihoods of the best and worst races,

display math

To calculate the marginal (best–worst) probability BWX(xy) from the parallel race model in the absence of response time data, the individual likelihoods of the best and worst races are integrated over all times t > 0 and r > 0, respectively. When fit in this manner to choices only, the parallel model provides an account of the dermatology and mobile phone data sets equal in quality to the three LBA models described in the first section (dermatology—correspondence between MNL model regression coefficients and log estimated LBA drift rates, R2 = .97, close agreement between observed and predicted choice proportions, R2 = .92, and 5% root-mean-squared prediction error; mobile phones—R2 = .99, R2 = .60, and 3.7%, respectively).

The parallel model overcomes the drawbacks of the three previous models by accounting for all general choice and response time trends observed in data. For instance, the model is able to capture inter- and intra-individual differences in response style—those participants who prefer to respond first with the best option, or first with the worst option—by allowing separate threshold parameters for the best and worst races. For example, a participant who primarily responds first with the best option is assumed to set a lower response threshold in the best race than in the worst race. This means that, on average, an accumulator in the best race reaches threshold prior to an accumulator in the worst race.

The parallel race model also accounts for the effect of choice difficulty on best- and worst-first responses, via differences in drift rates across rectangles. Very easy discrimination of the largest rectangle tends to occur when there is a large area for one rectangle, with a correspondingly large drift rate and so a fast response and a largest-before-smallest response order. Similarly, the parallel model predicts that difficult judgments rise to threshold more slowly than easy judgments. Therefore, irrespective of whether the best or worst race finishes first, the slower of the two races will still exhibit an effect of choice difficulty on latency. Since the best and worst races are (formally) independent, by extension the difficulty of the slower (second) judgment will also affect the interresponse time.

8.1. Estimating parallel model parameters from perceptual data

We fit the parallel model to individual participant data from the best–worst area judgment task. Our methods were similar to those used previously with the multi-attribute data. We estimated nine drift rate parameters, one for each rectangle stimulus. We fit the model twice, once where we ignored response time data (as before) and once where we used those data. When ignoring response times, we arbitrarily fixed A = 0, b = 1, s = 1 and t0 = 0 for the best and worst races. When fitting the model to response times, we estimated a single value of the start-point range, A, and non-decision time, t0, parameters, with separate response thresholds for the best and worst races, bbest and bworst (we again fixed s = 1, which serves the purpose of fixing a scale for the evidence accumulation processes). Therefore, when the parallel model was fit to response time data, it required four additional free parameters compared to fits to choice-only data. Regardless of the data type—choice-only, or choices and response times—the approach to parameter optimization was the same: For each participant and trial, we calculated the log-likelihood given model parameters, summed across trials, and maximized. We provide code to fit the parallel model to the choices and response times of a single participant in the freely available R language (R Development Core Team, 2012), as online supplementary material to this paper and in the “publications” section of the authors' website at http://www.newcl.org/.

9. Model fits

Although we fit the model to data from individual participants, for ease of exposition we primarily report the fit of the models at the aggregate level (i.e., results summed over participants). Unlike the previous fits, where we compared drift rate estimates to the corresponding random utility model regression coefficients, here we compare drift rate estimates to the area of the rectangles to demonstrate that the model recovers sensible parameter values. As above, we first assess goodness-of-fit by comparing observed and predicted choice proportions. For each participant, we calculated the number of times that each rectangle area was chosen as best (respectively, worst), and then normalized by the number of trials on which each rectangle area was presented. For the fits to response time data, we also examine goodness-of-fit by assessing the observed and predicted distribution of best and worst response times at the aggregated level. To demonstrate that the model captures individual differences, we also present examples of model fits to individual participant response time distributions. Finally, we compared predictions of the model to the response order data (choice proportions and response times).

9.1. Choice proportions

The upper panels of Fig. 9 plot mean estimated drift rate from the parallel model against rectangle area. Whether based on fits to choices-only or to choices and response times, the estimates followed a plausible pattern—mean drift rate increased as a sigmoidal function of rectangle area, which is the standard pattern in psychophysical judgments (e.g., Ratcliff & Rouder, 1998). There was a very strong effect of rectangle area on mean estimated log drift rate, for the fits to choice-only, F(2.1, 49.3) = 161, p < .001, and response time, F(2.1, 48.8) = 143, p < .001, fits.

Figure 9.

Estimated drift rates and goodness-of-fit to data for the parallel race model when fit to choices only, and choices and response times (left and right columns, respectively). The upper panels show mean estimated log drift rates as a function of rectangle area. Error bars indicate within-subjects standard errors of the mean. The middle and lower panels show the goodness-of-fit to the experimental data for best and worst responses, respectively. The x-axes display choice proportions from data, the y-axes display predicted choice probabilities from the parallel race model, and the diagonal lines show a perfect fit. In the lower panels, each participant contributed nine data points to each panel—one for each rectangle area.

The middle and lower panels of Fig. 9 show the goodness-of-fit of the parallel model to choice data, separately for both methods of fitting the model. Choice-only fits provided an excellent account of the best and worst choice proportions—both R2's > .98 and root-mean-squared difference between observed and prediction choice proportions of 1.7% and 4%, respectively. When the model was forced to accommodate response times as well as response choices, it still provided a good fit to choice proportion data: R2's > .95 and root-mean-squared prediction error of 7.1% and 7.5% for best and worst proportions, respectively. The slight reduction in goodness-of-fit is expected, since the latter model fits are required to account for an additional aspect in the data (response times)—predictions of response times and response choices are not independent, and the data contain measurement noise.

9.2. Response times

When fit to response times, the parallel model provides a good account of best and worst response time distributions. We first consider group-level data where we used quantile averaging to analyze aggregate response time distributions. Quantile averaging conserves distribution shape, under the assumption that individual participant distributions differ only by a linear transformation (Gilchrist, 2000; Fig. 11 suggests that this was generally the case in our data). For each participant and separately for best and worst responses, we calculated the 1st, 5th, 10th, 15th, …, 90th, 95th, 99th percentiles of response time distributions, and then averaged the individual participant percentiles to form an aggregate distribution of response times. Finally, we converted the averaged percentiles to histogram-like distributions, shown in the upper panel of Fig. 10. The averaged data demonstrate the stereotypical properties of latency distributions: sharp onset of the leading edge closely followed by a single peak and a slow decline to a long, positively skewed tail. In the aggregate distributions, best responses were faster than worst responses, as expected from the proportion of best-first responses across participants (Fig. 8). Importantly, the parallel race model provides a good account of the best and worst response time distributions, capturing each of the characteristic distribution trends just described.

Figure 10.

Response time distributions for experimental data and predictions from the parallel race model. The upper panel shows quantile averaged data as stepped histograms with best and worst responses shown in black and gray, respectively. The smooth density curves show model predictions with best and worst predictions shown in green and red, respectively, averaged in the same way as the data. The lower panels display model fits to a selection of individual participant data. We show participants broadly classified into three categories of responders: those who tended to respond with the best option first, the reverse, who tended to respond with the worst option first, and those participants who demonstrated no strong preference for either best or worst first responding. For model fits to all 24 participants, see Fig. 11 in Appendix B.

We next consider the fit of the parallel model to individual participant latency data. The lower half of Fig. 10 shows data from nine individual participants and demonstrates that the parallel race model provides a very good fit to individuals, whether they prefer best-first responding, worst-first responding, or a mixture (see Fig. 11 in Appendix B for model fits to the response time distributions of all 24 participants).

Figure 11.

Response time distributions for experimental data and predictions from the parallel race model. Each panel shows a separate participant. Data are shown as stepped histograms with best responses in black and worst responses in gray. Model predictions are shown as smooth density curves with best predictions in green and worst in red.

9.3. Response order effects

Figure 8 shows the proportion of best-first responses for each participant. Overlaid on those data are model predictions of the expected proportion of best-first responses for each participant. The model captures the qualitative trends in the pattern of best-first preference data across participants, with a smooth shift from predominantly worst-first participants through to predominantly best-first participants. However, the model does not capture the strength with which some participants prefer a best- or worst-first pattern of responding.

The parallel model can also capture the effect of the difficulty of the area judgment on the proportion of best- and worst-first responses and the mean latency between the first and second responses. We calculated separately for each participant the proportion of best- and worst-first responses as a function of choice difficulty in both data and model predictions, and examined the relationship by aggregating across these proportions. Again, the model provided a good account of the response proportion data across difficulty levels with R2's of .87 and .9 for the best- and worst-first responses, respectively. Similarly, the latency between the first and second responses was also influenced by the difficulty of the area judgment of the second choice—as the difficulty of the second judgment increased, so too did the interresponse latency. The parallel model predicts this qualitative trend in data: increased interresponse time with increased difficulty, with a predicted increase of approximately .4 seconds for each increase in difficulty level.

10. General discussion

Accurately eliciting preferences is important in a wide variety of applied fields from public policy to marketing, as exemplified by the dermatology and mobile phone data that we examined. Discrete choice, and in particular best–worst scaling, provide more robust and efficient measures of preference than alternative approaches, particularly when combined with random utility analyses. One drawback of random utility models is that they do not provide natural accounts of the cognitions underlying decision making or the times taken to make those decisions. On the other hand, accumulator (or race) models, which involve both the choices made and the time to make them, have proven useful in illuminating the cognitive and neurophysiological processes underpinning simple single-attribute decisions. Over the last several decades, there have been increasingly successful and practical applications of accumulator models to multi-attribute preference data, including decision field theory (Busemeyer & Townsend, 1992, 1993; Roe et al., 2001), the leaky competing accumulator model (Usher & McClelland, 2004), the Poisson race model (Otter et al., 2008; Ruan et al., 2008), and most recently the 2N-ary choice model (Wollschläger & Diederich, 2012). Our proposal builds on these developments and applies them to a more complex choice task, best–worst scaling, with the aim of capturing the best of both worlds: a model that has both tractable statistical properties and a cognitive interpretation. This is an example of “cognitive psychometrics,” which aims to combine psychological insights from process-based modeling with the statistical advantages of measurement approaches (see e.g., Batchelder, 2009; van der Maas, Molenaar, Maris, Kievit, & Borsboom, 2011).

In the first section of this article, we demonstrated how a simplified accumulator model (LBA: Brown & Heathcote, 2008) can make race models practical for the analysis of the complex, multi-attribute best–worst decisions typically required in many applications. The three MNL-inspired LBA variants we examined were all capable of fitting choice data at least as well as the standard random utility models of choice. The parameter estimates from the race models were closely related to the parameter estimates from the random utility models, providing further confidence in the use of the race models to describe those data.

In the second section of this article, we demonstrated the benefit that response times add to understanding the cognitive processes underpinning best–worst choice. The response time data obtained using an unconstrained best–worst response procedure in a simple perceptual experiment provided evidence against the assumptions of the three LBA models proposed in the first section. We then modified one of the LBA models to develop a new model with parallel best and worst races. The parallel model provided a good account of data at the individual participant and aggregate levels, for both choices and response times, including a range of newly identified phenomena relating to the effect of decision difficulty on the order and speed of best and worst responses.

Further work remains to determine whether the parallel model can account for the very long response times produced by complex multi-alternative choices, and whether a single model can accommodate best-only and best–worst choices with these types of stimuli. For both simple and complex choices, there are challenges remaining related to the fine-grained measurement and modeling of the time to make both best and worst choices. Methodologically, it would be desirable to use a faster response method than moving a pointer with a mouse and clicking a target to minimize the motor component of interresponse time. Recent approaches using eye movements appear promising in this regard (Franco-Watkins & Johnson, 2011a, b). A complementary approach would be to elaborate the parallel model to accommodate the time course of motor processes and to address paradigms like ours that by design do not allow the same best and worst response.

Despite the potential challenges of recording the latencies involved in choices about complex multi-attribute stimuli, we argue that it is worthwhile. One might wonder whether the empirical equivalence we have demonstrated between MNL- and LBA-based analyses calls into question the benefit of response times over and above choice data. There are at least two valid responses to this question. The first is that, from a practical point of view that focuses only on measurement, response times might not always be useful, and LBA models may not provide greater insights than traditional MNL models. Even with this view, the LBA models we have developed are important, because they provide a process-level instantiation for the MNL models—a process by which the utility estimates of the MNL models and the associated predictions might arise in a cognitively and neurophysiologically plausible manner.

The second response is to consider that there are, in fact, situations in which response time data provide extra information that is important and would not be available from regular choice data. This is particularly common when the question of interest is about the nature of the underlying cognitions—for example, as when the response time data from our perceptual experiment ruled out some LBA-based models. Even when the question of interest is about measurement only, response times can still be important. To illustrate, consider a hypothetical example based on emerging research in Australia that elicits attitudes toward health states. Suppose that a respondent's level of agreement with various attitudes toward abortion is elicited using best–worst scaling. The two attitude statements “abortion is fundamentally wrong” and “abortion is wrong even if the child would have significant cognitive impairment” may have similar value estimates (either both high or low, depending on the respondent's attitudes). However, the first statement is likely to induce a fast “emotional” response, and the latter a slower “cognitive” analysis. Similar distinctions have been drawn by Kahneman (2011; “thinking fast and slow”) and Hsee and Rottenstreich (2004, “valuation by feeling and valuation by calculation”). Perhaps surprisingly, neither reference cites studies involving response time. Accumulator-based models of preference have the capacity to measure these differences, via response times, whereas traditional random utility models do not.

Response time data are also important when there are no suitable real-world (revealed preference) data with which to validate discrete choice (stated preference) data. This arises frequently in health care, where markets are limited or non-existent and observed behavior is not a reliable guide to policy. Inability to validate the choice data with real-world choices pushes the researcher into a second best world in which triangulation of discrete choice results—obtaining similar value estimates from different conceptual models, as in this article—may be the only option. In such cases, the possibility of different response time distributions to options with similar values (e.g., quality of life for health states) provides added value to the analyses.

The main advantage of our LBA approach over earlier cognitive process models of multi-attribute choices and response times is its mathematical tractability. However, a disadvantage when compared with more complete models, such as decision field theory and the leaky competing accumulator model, is that the LBA models we have developed do not describe various context effects that occur in preferential choice. This happens because the LBA models belong to the class of “horse race” random utility models (Marley & Colonius, 1992), which are known to fail at explaining many context effects (see Rieskamp, Busemeyer, & Mellers, 2006). However, Trueblood, Brown, and Heathcote (2013) have recently extended the LBA approach to include contexts effects in best choice in a way that retains its computational advantages. Parallel extensions are easily envisaged for best–worst choice.

We conclude by proposing that—even though further development is desirable—the parallel best–worst LBA model as it stands is a viable candidate to replace traditional random utility analysis of data obtained by best–worst scaling. The parallel model provides an account of choice data equivalent to the MNL choice models, but in addition it accounts for various patterns in response time data and provides a plausible explanation of the latent decision processes involved in best–worst choice. Of course, if a researcher is only interested in choice data, then our results suggest that it is often appropriate to use the simplest MNL-type models; but even then, our LBA-based models provide a modern and well-validated cognitive interpretation for the MNL models. Further research is required to extend the LBA framework to include drift rate structures analogous to the complex dependent utility structures that appear in models in the generalized extreme value (GEV) class (McFadden, 1978; Train, 2009) which are fit to data that are not fit by the simplest MNL models; this class includes the simplest MNL models as a special case. Once such an extended LBA is developed, it will be possible and interesting to check whether the choice probabilities it generates match those of the corresponding GEV choice model.

Acknowledgments

This study has been supported by Natural Science and Engineering Research Council Discovery Grant 8124-98 to the University of Victoria for Marley. The funding source had no role in study design, or in the collection, analysis, and interpretation of data. The work was carried out, in part, while Marley was a Distinguished Professor in the Centre for the Study of Choice, University of Technology, Sydney.

Notes

  1. 1

    The theoretical properties of MNL representations for best and/or worst choice were developed in Marley and Louviere (2005); Marley, Flynn, and Louviere (2008); and Marley and Pihlens (2012). Flynn and Marley (2013) summarize those results.

  2. 2

    Marley (1989a, b) and Marley and Colonius (1992) present response time models that predict choice probabilities that satisfy the MNL (Luce) model. However, those models relate the choices made and the time to make them in a way that often does not hold in data.

  3. 3

    The normal truncation requires the density and cumulative density expressions for individual accumulators in Brown and Heathcote (2008; Equations 1 and 2) to be divided by the area of the truncated normal, Φ(−d/s).

  4. 4

    The parallel property of the scale u in the usual exponential form of MNL models (see the representation of the maxdiff model in Appendix A) is that an arbitrary constant can be added to its value without altering the predicted choice probabilities.

  5. 5

    Flynn et al. (2008) also performed a paired model conditional logit regression (without covariates) and a marginal model conditional logit analysis. The regression coefficients did not differ much across these analyses, and so we ignore those, for brevity.

  6. 6

    Marley and Pihlens (2012) also analyzed their data using other variants of the MNL framework, with little difference to the estimated coefficients. Again, for brevity, we show just the main analyses.

  7. 7

    These scores are normalized differences between the number of “best” responses and “worst” responses elicited in response to each attribute level.

  8. 8

    Data from three participants were removed from this analysis due to incomplete data in all cells.

Appendix A

The maxdiff model for best–worst choice

Using the generic notation of Fig. 2, BWX(xy) is the probability of choosing x as best and y ≠ x as worst when the available set of options is X. The maxdiff model for best–worst choice assumes that the utility of a choice option in the selection of a best option (u) is the negative of the utility of that option in the selection of a worst option (- u) and that

math image(1)

For the Flynn et al. (2008) data and analyses, X is a set of attribute levels, and so x and y are the attribute levels selected as best and worst, respectively. Marley et al. (2008) present mathematical conditions under which all the attribute levels are measured on a common difference scale; in this case, the utility of one attribute level can be set to zero.

For the Marley and Pihlens (2012) best–worst data and analyses, X is a set of multiattribute profiles, and so x and y are the profiles selected as best and worst, respectively. Marley and Pihlens also assume that each profile has an additive representation over the attribute levels; that is, assuming that each profile has m attributes, then there are utility scales ui, i = 1, …m, such that if z = (z1, …zm), with zi the attribute level for z on attribute i, then

display math

Marley and Pihlens (2012) present a set of mathematical conditions under which the utility of a profile is such a sum of the utilities of its attribute levels, and each attribute is measured on a separate difference scale; in this case, one level on each attribute can have its utility set to zero.

Appendix B

Parallel best–worst LBA model fits to individual participant response time distributions

In this Appendix, we show the fit of the parallel race model to response time distributions at the level of individual participant data, for all 24 participants. Each panel of Fig. 11 shows the fit of the race model to a separate participant. As described in the main text, there are clear individual differences in the pattern of responding—best-first, worst-first, or no preference for either best- or worst-first—as well as the time course of decisions—some participants made much faster responses than others. For both patterns of individual differences, the parallel model provides a good account of the latency distributions from the majority of participants.

Ancillary