Journal of the Royal Statistical Society: Series C (Applied Statistics)

The item count method for sensitive survey questions: modelling criminal behaviour



The item count method is a way of asking sensitive survey questions which protects the anonymity of the respondents by randomization before the interview. It can be used to estimate the probability of sensitive behaviour and to model how it depends on explanatory variables. We analyse item count survey data on the illegal behaviour of buying stolen goods. The analysis of an item count question is best formulated as an instance of modelling incomplete categorical data. We propose an efficient implementation of the estimation which also provides explicit variance estimates for the parameters. We then suggest specifications for the model for the control items, which is an auxiliary but unavoidable part of the analysis of item count data. These considerations and the results of our analysis of criminal behaviour highlight the fact that careful design of the questions is crucial for the success of the item count method.

1. Introduction

Asking sensitive questions about behaviour and attitudes is one of the most difficult challenges in survey measurement. In this paper we consider a question on illegal behaviour, but other sensitive areas include sexual activity, use of illicit drugs and embarrassing or socially undesirable opinions and prejudices. It is easily conceivable that many respondents may not give truthful answers to direct questions on such topics.

Measurement error in answers to sensitive questions may be reduced by some choices in the survey design, such as open-ended questions, asking about behaviour over long reference periods, tolerantly loaded introductions and self-administration of the sensitive questions (see Tourangeau and Yan (2007) and Groves et al. (2009) for overviews). Another common approach is the randomized response method, in which respondents employ a randomizing device to add probabilistic misclassification to their responses and thus conceal their true answers from the interviewer. The original randomized response method was proposed by Warner (1965), and other variants have been developed since (see Chaudhuri and Mukerjee (1988), Lensvelt-Mulders et al. (2005) and Tourangeau and Yan (2007) for overviews).

Another way of protecting the respondents' anonymity is the item count method or list experiment (Miller (1984); Raghavarao and Federer (1979) proposed a closely related approach), which has become increasingly popular recently (see Blair and Imai (2012) for a list of some applications). Its basic idea can be introduced with the question shown in Table 1, which will be considered in our application. Each respondent is presented with some or all of a list of questions with possible answers of yes or no. One of these is the sensitive item which is the focus of interest; in our case this is item 6, which asks whether the respondent has bought stolen goods in the past 12 months. All the other questions are control items which are not of direct interest and not meant to be sensitive. The survey respondents are randomly assigned to either the control group, whose list includes only the control items, or the treatment group, who receive both the control items and the sensitive item. In both groups a respondent is asked to report only their total number of yes answers but not the replies to the individual items. Table 2 shows the observed frequencies of these total counts in a sample in our application.

Table 1. The item count question on buying stolen goods, as included in the Euro-Justis survey
‘I am now going to read you a list of five [six] things that people may do or that may happen to them. Please
listen to them and then tell me how many of them you have done or have happened to you in the last 12 months.
Do not tell me which ones are and are not true for you. Just tell me how many you have done at least once.'
[Items included in both the control and treatment groups]
1. Attended a religious service, except for a special occasion like a wedding or funeral.
2. Went to a sporting event.
3. Attended an opera.
4. Visited a country outside [your country]?
5. Had personal belongings such as money or a mobile phone stolen from you or from your house.
[Item included in the treatment group only]
6. Bought something you thought might have been stolen.

The intention of the item count method is that respondents in the treatment group should feel able to include a truthful answer to the sensitive item in their response because they would realize that it will be hidden from the interviewer when only the total count is reported. Compared with the classical randomized response method, this has the advantage of avoiding the potentially distracting act of randomization by the respondents themselves during the interview. Potential disadvantages of the item count method are that only the treatment group provides information about the question of interest, and that the inclusion of the control items complicates the survey design and adds uncertainty to the estimation.

Table 2. Numbers of respondents with different reported totals for the item count question in the Euro-Justis survey
Group Item count Total

Quantities of interest can be estimated from randomized response data because the randomization mechanism is known. We may be interested in both estimates of the unconditional probability of the sensitive behaviour and regression-type questions about its associations with explanatory variables. For the item count method, a moment-based (mean difference) estimator has most often been used for the unconditional probability, and straightforward extensions of it for regression modelling (Chaudhuri and Christofides, 2007; Tsuchiya et al., 2007; Holbrook and Krosnick, 2010; Glynn, 2013; Coutts and Jann, 2011). These approaches, however, may be inefficient, and the regression methods in particular are not ideally suited for modelling a count response.

The key to more efficient and coherent analysis of any randomized response items is to treat it as a problem of incomplete categorical data. This idea has been applied to the modelling of classical randomized response designs by, for example, Maddala (1983), Scheers and Dayton (1988), Chen (1989) and van den Hout and van der Heijden (2004). For item count data, it was first fully recognized by Imai (2011), whose work represents a major advance in the methodology of the item count technique (see also Blair and Imai (2012); Corstange (2009) has also employed models for categorical data, for a related design where in the control group each item is asked individually).

In this paper we consider the modelling of item counts as categorical data. We expand on the results of Imai (2011) in two main ways. First, we propose a faster implementation of the estimation and an explicit estimate of the variance matrix of the estimated parameters. Second, we propose a more flexible set of choices for the model for the control items, which we argue can have a major effect on conclusions about the model of interest for the sensitive item. These points are developed in Section 'Modelling item count data'. An application on modelling criminal behaviour is introduced in Section 'Using survey data to model criminal behaviour' and analysed in Section 'Item count estimates of criminal behaviour'. In Section 'Considerations on the design of item count questions' we discuss the design of item count questions, and in Section 'Conclusions' we offer some conclusions.

2. Using survey data to model criminal behaviour

Our substantive application concerns criminal acts, specifically buying stolen goods. A prominent problem in criminology is understanding what shapes deviant and illegal behaviour. Why people do or do not commit crimes suggests what motivates them and how institutions can influence behaviour. To explain variation in such activity, we need a relatively precise estimate of it. Yet, criminal behaviour is a classic sensitive issue, which people tend to want to conceal in survey situations to create a good impression.

In the criminological literature, the different motivations can be grouped into instrumental and normative. Instrumental motivations are guided by rational choice where people make implicit or explicit calculations of the risks of action against the benefits; consonant modes of crime control policy focus on deterrence and punishment. Yet, thus far the evidence is mixed. It suggests that most people are not driven by calculations of the risks and benefits of committing crime. Accordingly, criminologists have started to move to trying to understand when and for which crimes deterrence might be an influence.

Normative motivations relate to the idea that reasons of morality explain why most people, most of the time, do not commit crimes. Influenced by socialization, family, friends and so forth, people do not act in ways that they believe to be wrong. Social pressure and disapproval plausibly reinforce one's motivation to act according to guiding moral principles. The important thing then is that people think that it is wrong to buy stolen goods, for example, and think that it is morally correct to obey the law simply because it is the law (Tyler 2006ab; Jackson et al., 2012). Criminological research on these issues continues apace, no doubt with many important insights still to come.

An especially interesting area of on-going research draws instrumental and normative motivations into a single model of motivation. For example, Kroneberg et al. (2010) reasoned that only a certain proportion of individuals make cost–benefit calculations when considering whether to commit a particular act. People who believe that the act is morally wrong do not believe that it is an option (see also Wikström et al. (2012)). From an analysis of a survey of Germans, on the specific crime of shoplifting, their findings were consistent with the idea that ‘ … respondents with strongly internalized norms disregard instrumental incentives to shoplift’. In other words, if people believe that it is morally wrong to shoplift, then risk–benefit calculations will not predict intentions to commit the crime, but if they think that it is not morally wrong to shoplift then these calculations start to come into play.

We examine similar questions by using data from a survey that was carried out in Italy, Bulgaria and Lithuania in October–November 2010. The surveys were conducted separately by different organizations in the three countries but co-ordinated by the principal investigators in the UK. The main purpose of the survey was to measure the legitimacy of the criminal justice systems in these three European countries, as part of a broader project into trust in justice known as Euro-Justis (Hough and Sato, 2011). One of the key outcomes of the Euro-Justis project was the inclusion of a module of questions in round 5 of the European Social Survey in 2010 (European Social Survey, 2011); however, this did not include the item count question that is considered here. We use the pooled sample of 2549 respondents (1007 for Bulgaria, 521 for Italy and 1021 for Lithuania). The surveys are not treated as probability samples from the national populations. The main aim of our analysis is regression modelling of illegal behaviour rather than estimation of population proportions.

We consider questions that were motivated by Kroneberg et al. (2010), using a subset of their concepts. The crime that we consider is buying stolen goods. An item count question on it was included in the Euro-Justis survey, worded as shown in Table 1 (and translated into the national languages). The frequencies of responses to it in the pooled sample are shown in Table 2.

We consider two explanatory variables of primary interest. The first is the assessment of the morality of the crime, measured by a survey question whose core part was worded as ‘ … please tell me how wrong it is to  … buy something you thought might be stolen’. The response options were ‘Not wrong at all’ (coded in our analysis as 1), ‘A bit wrong’ (coded as 2/3), ‘Wrong’ (1/3) and ‘Seriously wrong’ (0). The second variable that we focus on is personal financial circumstances, measured by the question ‘Which of [these descriptions] comes closest to how you feel about your household's income nowadays?’, with responses ‘Living comfortably …’ (0), ‘Coping …’ (1/3), ‘Finding it difficult …’ (2/3) and ‘Finding it very difficult on present income’ (1). We treat this as a measure of financial need but also as a rough proxy for perceived benefits of financial crime such as buying stolen goods—although acknowledging the obvious limitations of the latter treatment. ‘Don't know’ responses are coded as missing. In the survey, the question on morality was asked much earlier than the item count question, which came just before the question on financial need.

Our substantive hypotheses are that lower financial need and a view that buying stolen goods is morally wrong will be associated with a lower probability of having committed the crime. A further statistical hypothesis is that there would be a positive interaction between the two explanatory variables. This corresponds to the substantive hypothesis, in the spirit of Kroneberg et al. (2010), that strongly internalized norms lead people to disregard instrumental incentives to commit the crime.

The methodological questions that we consider next are how the item count data should best be analysed to answer the substantive questions, and what aspects of the quality of the data and the survey design affect the chances of success of this analysis.

3. Modelling item count data

3.1. Estimation

Consider an item count survey question which includes J control items and one sensitive item. Suppose that we have data on inline image respondents of whom inline image have been randomly assigned into the control group where the question included only the control items, and inline image to the treatment group where the sensitive item was also included. Let inline image, i=1, …,n, be a treatment indicator such that inline image for respondents i in the treatment group, and inline image in the control group.

Let inline image denote a respondent's answer to the item count question, with possible values 0,1, …,J+1, and inline image its observed value in the sample. Here inline image is the total for the control items, and inline image if the answer to the sensitive item is yes and inline image if it is no. For the control group inline image, but for the treatment group inline image and inline image and inline image are not observed separately (the value of inline image for the control group is hypothetical and not used). Define inline image and vectors s and Y similarly, and let inline image where inline image is a vector of explanatory variables for respondent i, including a constant 1.

We assume that inline image for different respondents are independent given inline image. The model for inline image is specified through two models: inline image for the sensitive item and inline image for the total of the control items, for inline image and inline image. Here inline image and inline image are two subsets of the variables in inline image, which need not be identical, and β and ψ are distinct parameter vectors. The substantive interest is on the model for inline image, for which we use the binary logistic model

display math(1)

When inline image, the only unknown parameter in this is inline image where inline image is the unconditional probability of positive response to the sensitive item. Different specifications for inline image are discussed in Section 'Specification of a model for the control items' Let inline image.

We make the following assumptions.

  1. The inline image included in inline image in the treatment group is a truthful answer to the sensitive question, so inline image is substantively interesting. In contrast, inline image need not be the true total for the control items, as long as the assumptions below are satisfied.
  2. inline image. This is satisfied by randomization, which makes the joint distribution of inline image independent of inline image for any inline image.
  3. inline image, i.e. that conditional on inline image the total reported for the control items is not affected by whether or not the sensitive item was included in the question. This assumption is not related to the randomization, so its plausibility must be considered separately.
  4. If the n respondents exclude any non-respondents who did not answer the item count question, the probability of non-response is independent of inline image given inline image.

Under these assumptions, the model for the reported total inline image in the observed data is

display math(2)

where we take inline image; these correspond to the two impossible values (0,J+1) and (1,0) of inline image in the treatment group.

When there are no explanatory variables, a simple moment-based estimator of inline image is

display math(3)

It follows from equation (2) that inline image is an unbiased estimator of inline image. It is also the maximum likelihood estimator under certain conditions, as discussed in Section 'Specification of a model for the control items'. However, it lacks the flexibility to accommodate less than saturated models for inline image and inline image. It also does not provide a generally convenient extension to models with explanatory variables. The most obvious generalization of equation (3) in this direction is the linear model inline image (Holbrook and Krosnick, 2010). This yields a consistent estimator of β for a linear probability model for inline image under certain assumptions about the model for inline image. However, a linear model for inline image is generally unappealing, and the approach again fails to provide flexibility for modelling inline image.

A more satisfactory framework for the analysis of item count data is to treat it as the categorical data problem that it clearly is. For a convenient description, we introduce first the device of a set of pseudodata which consists of two stacked copies of the observed data set. The pseudodata thus have m=2n observations where, for each i=1, …,n, observations i and i+n have the same values of the observed variables. We then add the pseudovariables inline image and inline image with the values inline image for i=1, …,n and inline image for i=n+1, …,m. We denote inline image and inline image for i=1, …,m, where by definition inline image is inline image for i=1, …,n and inline image for i=n+1, …,m.

The observed data log-likelihood is

display math

or, in terms of the pseudodata,

display math

One convenient way to maximize it is with the EM algorithm (Dempster et al., 1977). This is based on viewing the values of inline image as missing data. If they and thus also all inline image were observed, the complete-data log-likelihood would be

display math(4)

To express this in terms of the pseudodata, let inline image there be equal to the true inline image for each i=1, …,n. Then

display math(5)

where inline image if inline image and inline image otherwise. This means that, if inline image were observed, we would know which of the two paired rows of the pseudodata corresponded to the true value of inline image. As we do not know this, the EM algorithm gives weight to both possibilities. The algorithm proceeds by alternating between two steps.

  1. E-step: calculate the conditional expected value of the complete-data log-likelihood (5) given the observed data and current estimate inline image of θ, as
    display math(6)
    display math(7)
    inline imageinline image for i=1, …,n and inline image for i=n+1, …,m. The subscripts (t) indicate that all the probabilities in inline image are calculated with the parameter values inline image.
  2. M-step: maximize inline image to obtain updated estimates inline image. The two terms inline image and inline image can be maximized separately. These are weighted log-likelihoods for the complete-data models for inline image and inline image, with weights inline image. The updated estimates can thus be obtained by fitting the models to the pseudodata by using standard software, as long as these allow the fractional frequency weights inline image.

A maximum likelihood estimate inline image of θ is obtained by iterating the algorithm to convergence, starting with some initial values inline image. This is the estimation method that was proposed by Imai (2011) and implemented in the R software (R Core Team, 2012) with the package list (Blair and Imai, 2010).

Two well-known disadvantages of the EM algorithm are that it can be slow to converge and that it does not automatically provide an estimate of the variance matrix of inline image. Here the difference in speed from our Newton–Raphson algorithm described below is enhanced by the fact that the EM implementation requires two iterative procedures at each M-step where the Newton–Raphson estimation involves only a single non-iterative step. In our application the difference is mostly in convenience rather than real practical significance: a typical model might take less than 1 s and 15 iterations with Newton–Raphson iteration and 12 s and 150 (E-step) iterations with EM. For the variance matrix, Imai (2011) used numerical differentiation of l(θ;s), as implemented in the optim function in R, to approximate the observed data information matrix. We propose to replace this with a closed form expression.

We make use of an elegant but relatively little-used result which is implicit in some earlier literature on the EM algorithm but which was first stated explicitly by Oakes (1999). This shows that the function inline image derived at the E-step of the EM algorithm can also be used to calculate both the observed data score function, as

display math(8)

and the observed data information matrix, as

display math(9)

both of which hold for any value of inline image. In these expressions it is crucial that θ and inline image are treated as distinct quantities.

The results (8) and (9) allow us, first, to speed up the convergence of the estimation substantially by replacing the M-step of the EM algorithm with a Newton–Raphson update step

display math

Second, an estimated variance matrix of the estimates inline image is given by equation (9) as inline image. Because the Newton–Raphson algorithm may also diverge, to achieve convergence it is important to use sensibly chosen starting values and/or to shorten the update step for some iterations if needed.

For the model for item count data, inline image is given by equation (6) for inline image. Here inline image and inline image depend only on θ, and inline image, defined by equation (7), only on inline image. Denote inline image and inline image, evaluated at β. Define inline image where inline image; note that inline image are equal to inline image for i=1, …,n. Let inline image be the matrix with rows inline image for i=1, …,n, and inline image the matrix with rows inline image for i=n+1, …,m, and define inline image and inline image analogously with rows inline image where inline image is like inline image but evaluated with inline image. Let inline image in general denote a diagonal matrix with the elements of a vector inline image on the diagonal, and 1 a vector of 1s of appropriate length. The score function (8) is then

display math(10)

For the first term of equation (9),

display math(11)

and inline image are zero matrices. The second term of equation (9) is

display math(12)

which is symmetric when inline image and inline image are evaluated at inline image to become inline image and inline image.

These expressions apply when the model for inline image is the logistic model (1). The specific forms of equations (10)-(12) depend on the choice of the model for inline image. It can be seen that only the first and second derivatives of the logarithms of these probabilities are needed to complete the calculations. Explicit expressions for the four models that we shall consider are given in Appendix A. Finally, the observed data log-likelihood at the maximum likelihood estimates can be calculated from the pseudodata as

display math

with the probabilities evaluated at inline image.

More generally, any randomized response or comparable technique will involve incomplete data of some kind. This means that modelling methods for them can often be conveniently developed along similar lines to those above. References to such work for classical randomized response were given in Section 'Introduction'. For item count questions, this could be done for example for the modified version of Chaudhuri and Christofides (2007) which is designed to avoid the ceiling effect that is discussed in Section 'Considerations on the design of item count questions' (although some external information is then also needed by design). Jann et al. (2012) described such modelling for yet another method for sensitive questions: the crosswise model of Yu et al. (2008).

3.2. Specification of a model for the control items

The formulation of the problem in the previous section makes it clear that any analysis of item count data involves a model inline image for the totals inline image of the control items, whether or not this is explicit in the formulae of estimators. This model is a distinctive element of the method, which does not arise in classical forms of randomized response. It is a nuisance element which is of no substantive interest in itself. Nevertheless, it still needs to be specified appropriately, lest errors there distort estimates of the model of interest for inline image. In this section we compare possible choices for the model for the control items.

There are two parts to the specification of the model for inline image: the choice of the distribution itself, and how it may depend on the explanatory variables inline image and inline image. For the distribution, Imai (2011) and the computer implementation by Blair and Imai (2010) considered the binomial and the beta–binomial distributions. For the explanatory variables they used inline image or inline image, i.e. the same variables as in the model for inline image, or these interacted fully with inline image. Here we suggest some generalizations for both of these elements of the model. Details of the specific models that we use are given in Appendix A.

Let inline image, j=1,…,J, denote respondent i's unobserved answer to control item j, with values 0 for no and 1 for yes, so that inline image. Suppose that each inline image follows a Bernoulli distribution with probability inline image, and that different items inline image may be dependent, with covariances inline image. Then inline image has mean inline image and variance

display math(13)

where inline image. The mean is equal to that of a binomial distribution with index J and probability inline image. The first term of equation (13) is the variance of this binomial distribution, whereas the last two terms represent overdispersion or underdispersion relative to this variance. The second term, which is due to heterogeneity of the probabilities inline image, is always negative and thus contributes underdispersion. The third term can be positive or negative, depending on the pattern of dependences between the inline image.

If there is neither heterogeneity nor dependence, the last two terms of equation (13) are both 0 and the distribution of inline image is binomial. If there is no dependence, the last term is 0 and we obtain the Poissonian binomial distribution in the sense of Johnson et al. (1992), section 3.12.2, which is always underdispersed relative to the binomial distribution. If there is no heterogeneity, the second term is 0. If we then further assume that the covariances inline image are all equal, we obtain inline image where ρ is the common ‘intraclass correlation’ between inline image for respondent i. The beta–binomial distribution has a variance of this form. Its standard motivation as a mixture distribution implies that ρ is non-negative, but more generally the distribution is also well defined for some negative values of ρ (Prentice, 1986).

It is, however, undesirable to consider only such special cases of the distribution of inline image. First, even a general version of the beta–binomial model, say, cannot accommodate a distribution which is strongly underdispersed relative to the binomial distribution. Such a distribution for the control items would in fact be ideal from a design point of view (see the discussion in Section 'Considerations on the design of item count questions'), so we should prefer a distribution which can represent such items if we do manage to create them. Second, in the item count context it may not be enough to model well only the mean and variance of inline image. Expression (2) of the probabilities for the observed data shows that these involve all the probabilities of individual values of inline image. An adequate model for all of them is thus needed to disentangle them correctly from the model for inline image.

We suggest that by default the distribution of inline image should be specified with maximum flexibility, as a multinomial distribution with index 1 and probabilities inline image for z=0, …,J. In the examples below we compare the multinomial with the binomial and beta–binomial models. For dependence of the multinomial probabilities on explanatory variables we consider two possibilities: the multinomial logistic model, which ignores the ordering of the values of inline image, and the ordinal logistic (proportional odds) model, which takes the ordering into account. The ordinal model is in principle appealing because it is flexible in the response distribution but relatively parsimonious in the parameterization of the effects of the explanatory variables.

Consider now choices for the explanatory variables for the model for inline image. We denote these by inline image. They include at least inline image but possibly also inline image and even products (interactions) between inline image and some or all of inline image. Note that inline image does not need to include the same variables as inline image, and there may be some gain in efficiency if it does not. Throughout, models with nested choices of inline image may be compared by using likelihood ratio tests.

The most consequential aspect of the model for inline image is the extent to which it depends on inline image. In particular, the estimates will be most efficient when it does not, i.e. when inline image and inline image are conditionally independent given the explanatory variables. If this is so, expression (2) shows that inline image. The parameters ψ could then even be estimated directly by fitting the model for inline image for inline image for the control group only, and the data in the treatment group will contribute information mostly about the model for inline image. If the conditional independence does not hold, a smaller amount of information is available on both models, and both are mixed up in both groups.

Further insight into this loss of information is provided by the form of the information matrix I(θ) in equation (9). Its second term can be seen as the ‘missing information’ due to the fact that inline image are not observed. All of its terms depend on the quantities inline image, which are the predictive variances of the unknown inline image given inline image and inline image. For ψ, and through the cross-derivative terms of I(θ) also for β, the missing information also involves contributions from inline image, which are of the form inline image. These describe how different the gradients of inline image are at the two possible values 0 and 1 of inline image. The magnitude of these differences, and thus the amount of missing information, depends on the specification of the model for inline image. The key feature of the case where inline image is conditionally independent of inline image is that then inline image for all observations in the control group, so they do not add anything to this component of the missing information.

A model of special interest is one which has inline image and inline image multinomially distributed with inline image. This uses 2J+1 parameters to model the J+1+J=2J+1 free probabilities inline image in the table of randomization group t by observed item count s (see Table 2). The model is thus saturated, and the maximum likelihood estimators of inline image are the observed sample proportions inline image. Solving the expressions of these probabilities in expression (2) for inline image and inline image, we obtain as estimate of inline image the mean difference inline image given by equation (3), and for inline image

display math(14)


display math(15)

for j=0, …,J, starting with inline image (see also Glynn (2013) (supplementary material), who gave corresponding expressions for the probabilities of Y given Z). These are equal to the maximum likelihood estimates of inline image and inline image obtained as in Section 'Estimation', if inline image are all non-negative (the case where they are not is discussed in the next section). This equivalence demonstrates that any gain in efficiency that is obtained by the maximum likelihood estimators over the mean difference (3) is not due to the formulation of the problem as a model for categorical data but is only realized if we can assume a more parsimonious model for inline image than a multinomial model with inline image and inline image are independent.

We conducted a limited simulation study of the effect of the specification of the model for inline image. Four situations were considered, with 1000 data sets simulated in each. There were no explanatory variables. In each case the sample size was 2400 with 1200 observations in each of the treatment and control groups, inline image was drawn from a Bernoulli distribution with probability inline image and there were five control items. The four cases differ in the model for inline image and represent different combinations of its distribution and whether this depends on inline image. In cases 1 and 2, inline image follow a binomial distribution, in case 1 with probability 0.25 for all respondents, and in case 2 with probability 0.25 when inline image and 0.355 when inline image. In cases 3 and 4, inline image follow a multinomial distribution; in case 3 with probabilities (0.2,0.2,0.2,0.2,0.1,0.1) for Z=0, …,5 for every respondent; in case 4 these probabilities apply when inline image and (0.1,0.15,0.15,0.2,0.2,0.2) when inline image. The multinomial probabilities are chosen so that they are not well represented by a binomial distribution.

Table 3 shows the results of the simulation for estimates of inline image. We consider maximum likelihood estimators under the four models listed in Appendix A, each under both the assumption that inline image and inline image are independent and that they are dependent, thus for a total of seven estimators (the multinomial logistic and ordinal logistic model are equivalent under independence). There were a handful of simulations where some estimators converged to a very small value (less than −10 on the logit scale). These may represent false convergence of the algorithm, so simulations where this happened for any estimator are excluded; there were no more than 19 instances of this in any of the four cases.

Table 3. Results of a simulation study of estimators of the probability of a sensitive item in item count data†
True model for Z Results (×1000) for the following estimated models for Z:
  Binomial Beta–binomial Multinomial logistic Ordinal logistic, dependent
 IndependentDependent Independent Dependent Independent Dependent  
  1. †The table shows means and root-mean-squared errors of estimators over 1000 simulated data sets. See the text for the details of the simulation specifications. Independent, Z and Y are modelled as independent; dependent, Z and Y are modelled as dependent.

Mean (true value is 100)
Root-mean-squared error:

The simulation means in the upper part of Table 3 show that all the estimators are approximately unbiased when the model for inline image is correctly specified or overparameterized. When this model is incorrectly specified, however, the estimator of inline image is biased, in many cases dramatically so. This happens when either the distribution of inline image or the association between inline image and inline image is misspecified. It is worth noting that at least in these cases the one additional parameter of the beta–binomial model reduces the bias substantially relative to the binomial model when the true model is multinomial.

The main message of the lower part of Table 3 concerns the loss of efficiency when we must assume dependence between inline image and inline image. This can be seen by comparing the results under independence models for correctly specified distributions. It can be seen that this loss of efficiency is small when the true distribution is binomial but much larger when it is multinomial. In the latter case the use of an ordinal model for the dependence improves the efficiency slightly but still leaves it far lower than that of the independence model.

4. Considerations on the design of item count questions

Careful design of the survey items is clearly a prerequisite for the success of the item count methodology. Here we consider briefly some elements of design (for more extensive discussions, see for example Glynn (2013), Blair and Imai (2012), and references cited therein). We focus on the technical questions of articulating the assumptions that the items should satisfy, how these affect the model specification and how they may be checked in the analysis. This is informative on, but does not directly answer, the central practical challenge of design, which is how we should choose the items to have a good chance of satisfying the assumptions. This question is not amenable to simple technical analysis or solutions, because the success of the exercise is ultimately dependent on how the survey respondents react to the questions. For understanding and predicting these reactions, the designer of an item count question will benefit from a good knowledge of the general theories and empirical evidence on the psychology of survey response (Tourangeau et al., 2000).

The item count technique or any other randomized response method is also likely to involve psychological peculiarities of its own. These relate to what might be termed the ‘weirdness factor’ of the method, which is created when the interviewer appears to the respondent to deviate from the implicit social contract of what a survey interview should involve. With a classical randomized response question this happens when the respondent is suddenly asked to do something like to spin a dial to choose at random which question they should answer. In an item count question, the weirdness arises from being presented with a list of apparently disparate items with no indication of why the total of them might be of interest to the interviewer. The strangeness of this may be less than that of the dial spinning but it can still be non-negligible. It thus cannot be taken for granted that the respondents will react to an item count list exactly as intended, even if all the individual items are ostensibly simple and easy to answer.

Validity of item count measurement requires that the assumptions (a)–(d) stated at the beginning of Section 3.1 are satisfied. Of them, assumption (b) is satisfied by the randomization unless it is undermined by failure of assumption (d), i.e. differential non-response. The other assumptions cannot be guaranteed through design.

Assumption (a) of no lying is the motivation of the item count technique in the first place, in that it is designed to reduce reasons for lying by guaranteeing anonymity. This protection will fail completely in one situation: the ‘ceiling effect’ of a respondent in the treatment group whose truthful answer to all of the items would be affirmative, in which case a truthful total of J+1 would logically reveal the answer to the sensitive item. Direct evidence of the prevalence of this problem is given by the proportion of counts of J in the control group. In design, the aim should then be to select control items for which few respondents would give only affirmative answers. One way to achieve this is to use items which are individually rare. Another, which also reduces the chances of the ‘floor effect’ discussed below, is to choose a control set where some pairs of items are negatively correlated (Glynn, 2013).

Often discussed alongside the ceiling effect is the floor effect of a respondent in the treatment group whose truthful answers would be negative to all the control items but affirmative to the sensitive item. The argument is that such a person might judge that a truthful count of 1 would be known to correspond to the sensitive item. This, however, follows logically only if the interviewer can reasonably conclude from their observation of that respondent that his or her answer to all the control items must be no, a situation which should not be allowed to arise with a sensible set of control items. In other cases, concern with the floor effect involves a less compelling argument which requires reference to a population of other respondents, i.e. a judgement that the control items are such that most people's answers to all of them are likely to be negative. This does not necessarily follow even if all the items are individually rare.

A violation of the non-response assumption (d) can also lead to violation of assumption (c). Apart from that, assumption (c) is essentially a requirement of compliance, that the respondents actually respond to the question as stated and report a sum of the items rather than somehow react to the list as a whole (in which case it could matter whether the sensitive item was on it or not). Assuring this at the design stage is a considerable challenge. In the analysis, there is one obvious if partial way of examining the validity of the assumption. This is to use the logical conditions that P(Ss|T=0)≥P(Ss|T=1) for all s=0, …,J−1 and P(Ss−1|T=0)≤P(Ss|T=1) for all s=1, …,J (Blair and Imai, 2012). If these do not hold for all sample proportions, some of the moment estimates (14)–(15) of the probabilities of Z will be negative. This may occur because of sampling variation even when assumption (c) is satisfied, so a significance test for the conditions is needed; such a test was proposed by Blair and Imai (2012). A test result that does not detect a significant violation is not sufficient evidence that the assumption is satisfied, but a significant test result does provide strong evidence that it is not. Furthermore, apart from sensitivity analysis of the possible biases there is nothing that can really be done in the analysis to adjust for such a violation. The conclusion from this part of the analysis may thus be the disheartening one that an item count question is irretrievably flawed.

Another element of validity is that of the model specification, especially of the model for the control items as we argued in Section 3.2. Even when correct, this model also affects the efficiency of the analysis, i.e. the extent of missing information that reduces the precision of the estimates of the parameters of interest. This depends on the complexity of the model for the control items, most of all on whether or not they are conditionally independent of the sensitive item. This conditional independence should be one aim of the design of an item count question. It should not be impossible to achieve when the control items are unrelated in content to the sensitive item, but it cannot be guaranteed in advance.

These considerations suggest that the ideal set of control items would be one for which every respondent would report the same count z with 0<z<J, achieved in such a way that the z would not be the same items for everyone (to avoid a version of the floor effect). Such items would both satisfy all the assumptions for validity and maximize the efficiency of the estimates. They are of course unachievable in practice but worth keeping in mind as a general aim.

The design of an item count question is likely to involve trade-offs between different aims. For example, a list of control items which are independent of the sensitive item and negatively correlated with each other may seem particularly odd to the respondent and thus increase the risk of non-compliance. Chaudhuri and Christofides (2007) emphasized this danger and recommended instead choosing sensitive and control items which are thematically related—which will then mean that they are unlikely to be statistically independent.

How then do the Euro-Justis item count questions measure up against these criteria? When the questions were designed, there was a conscious attempt to avoid very common items and some aim to include items which would be weakly or even negatively correlated with each other. To try to reduce the weirdness factor, the list consists of relatively general and not strikingly peculiar inquiries. Furthermore, item 5, which asks whether the respondent has been a victim of crime, was included to try to make the list seem a little less out of place in a survey that was otherwise mostly about crime and criminal justice. This, however, may have had the consequence of introducing an association between the sensitive item and the total of the control items. It is clear from the analysis below that there is indeed such an association, and it may be because victimization and criminal behaviour tend to be correlated.

The survey organizations in the three countries of the Euro-Justis survey each produced a field report where the interviewers summarized their own experiences and common reactions by the respondents. It is encouraging to note that none of the three reports mentioned any concerns about the item count question. This contrasts with comments, which came consistently from all the countries, that many respondents had reacted negatively to direct questions elsewhere in the survey on other sensitive topics such as family income and attitudes towards crime.

Only 5.6% of the respondents in the control group and 4.7% in the treatment group refused to answer the item count question, so non-response is unlikely to be a major source of bias. The potential for ceiling effects is also minimal, as only 21 of 1206 respondents in the control group gave the maximum value of 5. Less reassuring is the finding that two of the cumulative proportions in the observed sample are inconsistent. These are the proportions for counts of 0 and 2, where the cumulative probability is smaller in the control than in the treatment group (0.223 versus 0.230 and 0.828 versus 0.830). These differences are not statistically significantly negative, so they can be due to sampling variation. Nevertheless, they give some reason to worry that assumption (c) of consistency of responses may be violated.

5. Item count estimates of criminal behaviour

Table 4 shows estimates for models without explanatory variables for the item count question in the Euro-Justis survey. The quantity of main interest is the probability inline image of the sensitive item Y—having bought stolen goods in the past year. Table 4 also includes estimates of the probabilities of different counts Z for the five control items. Here the focus is on how the estimates of inline image are affected by different choices for the model for the control items. We consider the moment-based estimators given by equations (3), (14) and (15), as well as maximum likelihood estimators with each of the four models for Z that are listed in Appendix A, each of the latter both with and without the assumption that Z and Y are independent. For assessment and comparison of model fits, Table 4 includes the Akaike information criterion statistic inline image where q is the number of estimated parameters in a model, and the p-value for a inline image-test of goodness of fit which compares the fitted counts for reported totals S from each model with the observed counts that are shown in Table 2.

Table 4. Probabilities of buying stolen goods, inline image, and of different counts for the five control items, estimated from the item count question in the Euro-Justis survey†
Estimator (model for Z) Results for the model for the control counts Z inline image Standard error inline image AIC
  Conditional on Y? Counts     
  1. †The table shows estimates under different model assumptions for the control items.

  2. p-value for a inline image-test of goodness of fit compared with the observed counts shown in Table 2.

  3. §With 2 degrees of freedom, allowing post hoc for the two estimated probabilities of 0.


As discussed in Section 'Considerations on the design of item count questions', two of the moment-based estimates of the probabilities of Z are negative; these probabilities have boundary estimates of 0 in the saturated multinomial logistic model where Y and Z are dependent. For each model for Z, the hypothesis of independence between Z and Y is clearly rejected, and there is a substantial difference between the estimated probabilities of Z conditional on the two values of Y. The probabilities given Y=0 are generally similar to those obtained when Z and Y are assumed independent, whereas the estimated probabilities given Y=1 are much less stable.

For these data, the estimated proportion of people who have bought stolen goods is fairly sensitive to the model assumed for the control items. Point estimates of the proportion are 0.12–0.14 when the clearly inappropriate assumption of independence between Z and Y is made, but 0.02–0.09 without it. Different dependence models also produce rather different results. The binomial and beta–binomial models give the higher estimates of 0.06–0.09. The goodness of fit of these two models is inadequate according to the inline image-test even when Z and Y are dependent. Only the multinomial models where Z depends on Y yield a good fit, both with a multinomial logistic and an ordinal logistic model for the dependence. Estimates of inline image are then 0.041 with the multinomial and 0.017 with the ordinal model.

The differences between these estimates appear somewhat less dramatic when we acknowledge also the uncertainty in them, as revealed by the estimated standard errors for inline image shown in Table 4. The 95% confidence intervals for inline image (derived from intervals for inline image) are (0.016–0.103), (0.005–0.058), (0.037–0.108) and (0.064–0.117) for the multinomial, ordinal, beta-binomial and binomial dependence models respectively, so there is substantial overlap between most of them. This would increase further if we tried to allow for misspecification of the models for the control items by using, say, sandwich-type estimators of the standard errors. The same would be true to a smaller extent for results obtained with independence models for the control items, but there the primary conclusion must still be that these models would lead to non-trivially different conclusions about the quantity of interest.

In Table 5 we turn to regression modelling of the item count question. The first analyses have given us some reason to approach this with caution, as the apparent association between the sensitive and control items will reduce the available information, and as there is some indication of non-compliance by the respondents. However, it is still of interest to see what the item count question can reveal about associations between explanatory variables and self-reported criminal behaviour.

Table 5. Regression models for the item count question in the Euro-Justis survey†
  Results for model 1 Results for model 2 Results for model 3 Results for model 4
  1. †The table shows estimated coefficients and (in parentheses) their standard errors for binary logistic models for having bought stolen goods in the past 12 months, and for ordinal logistic models for the total count of the control items. The constant terms for every model for Z are approximately (−3.1,−1.2,0.1,1.2,2.4).

Model for sensitive item Y (buying stolen goods)
Constant−2.97 −3.11 −3.82 −4.72 
Morality × need      −2.58(2.16)
Model for the total Z of the control items
Y   0.99(0.40)1.11(0.42)0.99(0.38)

We consider two substantively interesting explanatory variables: the respondents’ judgement of the moral acceptability of buying stolen goods and their self-reported economic circumstances. The motivation and definitions of these variables were given in Section 'Using survey data to model criminal behaviour'. The respondent's age in years is also included as a control variable. For the control items we use an ordinal logistic model. This provides flexibility for the choice of the distribution by treating it as multinomial but is more parsimonious than a multinomial logistic model in how the effects of the explanatory variables are specified.

Estimates for four models are shown in Table 5. Models 1 and 2 include all three explanatory variables in the models for both Y and Z. They differ in that in model 1 the outcomes Y and Z are conditionally independent given the predictors, whereas in model 2 Z depends also on the main effect of Y as an additional explanatory variable (the model with interactions between Y and the other predictors was not significantly different from this). The difference between these models is statistically significant, so the previous conclusion that Y is significantly associated with Z holds even after controlling for the three explanatory variables. A comparison of the estimated coefficients and their standard errors between these two models shows clearly that conclusions about the model of interest may be strongly affected by whether or not an association between Z and Y is included, and that uncertainties are substantially increased if it needs to be included.

In model 3 we remove two explanatory variables from model 2: age from the model for Y and moral judgement from the model for Z. Neither is significant in model 2, and for the morality variable there is also no substantive motivation for considering it as a predictor for the control items. In this model, the associations involving Z are strongly significant and substantially sensible. They indicate that older people and people who are struggling on their present income tend to have engaged in fewer of the activities on the control list. The effect of Y is that people who have bought stolen goods tend to report higher totals for the control items. As discussed in Section 'Considerations on the design of item count questions', a possible substantive explanation of this involves the control item on having been a victim of crime.

The model of interest in model 3 includes moral judgement and financial need as explanatory variables. These are taken to represent aspects of normative and instrumental motivations of criminal behaviour respectively. Their estimated effects are in expected directions: respondents who have a higher need are more likely to have bought stolen goods, as are people who do not regard such action as morally wrong. The coefficient of moral judgement is not significant, but that of financial need is. For at least one explanatory variable the item count has thus provided enough information for us to be able detect a positive association between it and the sensitive item, separate from its (negative) association with the control items.

Finally, in model 4 we examine the last substantive hypothesis that was discussed in Section 'Using survey data to model criminal behaviour': that of an interaction between morality and need. Its point estimate is negative. This would be an intriguing conclusion in that it would suggest that moral judgement makes a difference only when need is low, and need makes a difference only when an act is judged immoral in general—which would be the exact opposite of the hypothesis that was proposed by Kroneberg et al. (2010). However, the interaction is clearly not significant so no firm conclusions should be drawn. It is apparent that estimating such an interaction is beyond what the information from these item count data can reliably support.

6. Conclusions

The item count method is a valuable and increasingly commonly used addition to the methodology of asking sensitive survey questions. It has some definite advantages over both direct questioning and other randomized response methods. Statistical analysis of item count data can be implemented elegantly and efficiently with methods for categorical data analysis for incomplete data. We illustrated this in our application, where substantively plausible models for illegal behaviour were obtained.

The method also has its disadvantages and peculiar methodological challenges. Most of these stem from the distinctive feature of an item count question, which is the list of the control items. Even though these items are of no direct substantive interest themselves, careful attention must be paid to them so as not to compromise information about the sensitive item of interest. We have argued that at the analysis stage sufficiently flexible model specification for the total of the control items is crucial, in particular that it should usually be treated as multinomially distributed.

Most of the effort and ingenuity in the design of an item count question should also be devoted to the control items. For validity, responses to them should not be affected by the presence of the sensitive item on the list or to give respondents reasons to lie about it, and for efficiency the control items should ideally be independent of the sensitive item. At the design stage it is not easy to be confident that these conditions will be satisfied. At the analysis stage, failures of them cannot always be detected and, even when they can, are typically not correctable.

All of this makes a substantial challenge for designers of surveys on sensitive topics, and a challenge which will no doubt generate much future research. One practical recommendation that it suggests is that we should aim to build up a body of knowledge about specific item count questions, so that control items which have been found to work well in the past could be used again, even in item count surveys of different sensitive topics.


The Euro-Justis project (full title ‘Scientific indicators of confidence in justice: tools for policy assessment’) was funded by the seventh framework programme of the European Commission under grant agreement 217311. We are grateful to Daniel Oberski for introducing the item count method to us, and to Ben Lauderdale for helpful comments. We thank the Joint Editor, the Guest Associate Editor and a referee for their comments and suggestions.

Appendix A: Details of models for the control items

The model expressions in Section 'Modelling item count data' involve probabilities of the form inline image for observations i, for different values of inline image and inline image. We obtain them by specifying a model for inline image, for all j=0,1, …,J. Defining inline image for j=0, …,J such that inline image if inline image and inline image, we have inline image. We denote by inline image the vector of functions of inline image and inline image, possibly including interactions between them, which enters the linear predictor of the model. Let inline image and inline image denote the matrices whose rows are inline image for the first and last n observations of the pseudodata respectively, where inline image and inline image respectively, and let inline image.

We also need to define the quantities that depend on the model for inline image and that appear in the estimation procedure that was described in Section 'Estimation'. These include the matrix in equation (11) and inline image, inline image, inline image and inline image in equations (10) and (12). For the latter, we give the formulae for inline image, where inline image and inline image for the first n rows of the pseudodata, inline image are the rows of inline image and all the probabilities are calculated at parameter value ψ. The forms of the other three matrices are similar, except that for inline image and inline image the data are inline image, inline image and inline image for the last n rows of the pseudodata, and that for inline image and inline image the parameters are set at inline image.

Below we define these quantities for four different models.

A.1. Binomial logistic model

Here inline image follows a binomial distribution with index J and probability inline image. Then

display math

Let inline image be inline image for the observations i=1, …,m of the pseudodata, evaluated at ψ. Then inline image for i=1, …,n, and equation (11) is inline image for i=1, …,m.

A.2. Beta–binomial logistic model

We specify the beta–binomial model as in Prentice (1986). Let inline image as for the binomial logistic model, and let ρ denote the intraclass correlation. For simplicity we consider here only models where ρ is a constant, but this can easily be generalized. Then inline image and

display math

where inline image. Here any product with an upper limit of −1 is taken to be 1, and any such sum is taken to be 0. Let inline image be inline image for the observations i=1, …,m of the pseudodata, evaluated at ψ. Then inline image where

display math


display math

for i=1, …,n. The elements of inline image are

display math
display math


display math

for i=1, …,m, where

display math


display math

A.3. Multinomial logistic model

Here inline image follows a multinomial distribution with index 1 and probabilities

display math

for j=1, …,J and inline image, so inline image. Let inline image be inline image for the observations i=1, …,m of the pseudodata, evaluated at ψ, and define, for i=1, …,m, inline image if inline image and inline image othewise. Then

display math

for i=1, …,n, and equation (11) is

display math

for i=1, …,m.

A.4. Ordinal logistic model

Redefine inline image by omitting its first element which corresponds to the constant term of the model, and let inline image for j=0, …,J−1, where inline image is a J-vector of 0s, except for a 1 as its (j+1)th element, and let inline image and inline image be vectors of 0s. Let inline image. Here inline image follows a multinomial distribution with index 1 and probabilities inline image for j=0, …,J, where inline image, inline image and inline image for j=0, …,J−1. Let inline image and inline image be matrices whose rows i=1, …,m are inline image and inline image respectively when inline image, and let inline image and inline image be the first n rows of these matrices respectively. Let inline image and inline image when inline image. Then inline image and equation (11) is given by inline image where

display math


display math

for i=1, …,m.