The item count method for sensitive survey questions: modelling criminal behaviour
The item count method is a way of asking sensitive survey questions which protects the anonymity of the respondents by randomization before the interview. It can be used to estimate the probability of sensitive behaviour and to model how it depends on explanatory variables. We analyse item count survey data on the illegal behaviour of buying stolen goods. The analysis of an item count question is best formulated as an instance of modelling incomplete categorical data. We propose an efficient implementation of the estimation which also provides explicit variance estimates for the parameters. We then suggest specifications for the model for the control items, which is an auxiliary but unavoidable part of the analysis of item count data. These considerations and the results of our analysis of criminal behaviour highlight the fact that careful design of the questions is crucial for the success of the item count method.
Enhanced Article Feedback Close the feedbackYou are previewing our new enhanced HTML article.
If you can't find a tool you're looking for, please click the link at the top of the page to go "Back to old version". We'll be adding more features regularly and your feedback is important to us, so please let us know if you have comments or ideas for improvement.
Select a PDF to begin download
Asking sensitive questions about behaviour and attitudes is one of the most difficult challenges in survey measurement. In this paper we consider a question on illegal behaviour, but other sensitive areas include sexual activity, use of illicit drugs and embarrassing or socially undesirable opinions and prejudices. It is easily conceivable that many respondents may not give truthful answers to direct questions on such topics.
Measurement error in answers to sensitive questions may be reduced by some choices in the survey design, such as open-ended questions, asking about behaviour over long reference periods, tolerantly loaded introductions and self-administration of the sensitive questions (see Tourangeau and Yan (2007) and Groves et al. (2009) for overviews). Another common approach is the randomized response method, in which respondents employ a randomizing device to add probabilistic misclassification to their responses and thus conceal their true answers from the interviewer. The original randomized response method was proposed by Warner (1965), and other variants have been developed since (see Chaudhuri and Mukerjee (1988), Lensvelt-Mulders et al. (2005) and Tourangeau and Yan (2007) for overviews).
Another way of protecting the respondents' anonymity is the item count method or list experiment (Miller (1984); Raghavarao and Federer (1979) proposed a closely related approach), which has become increasingly popular recently (see Blair and Imai (2012) for a list of some applications). Its basic idea can be introduced with the question shown in Table 1, which will be considered in our application. Each respondent is presented with some or all of a list of questions with possible answers of yes or no. One of these is the sensitive item which is the focus of interest; in our case this is item 6, which asks whether the respondent has bought stolen goods in the past 12 months. All the other questions are control items which are not of direct interest and not meant to be sensitive. The survey respondents are randomly assigned to either the control group, whose list includes only the control items, or the treatment group, who receive both the control items and the sensitive item. In both groups a respondent is asked to report only their total number of yes answers but not the replies to the individual items. Table 2 shows the observed frequencies of these total counts in a sample in our application.
Table 1. The item count question on buying stolen goods, as included in the Euro-Justis survey
|‘I am now going to read you a list of five [six] things that people may do or that may happen to them. Please|
|listen to them and then tell me how many of them you have done or have happened to you in the last 12 months.|
|Do not tell me which ones are and are not true for you. Just tell me how many you have done at least once.'|
|[Items included in both the control and treatment groups]|
|1. Attended a religious service, except for a special occasion like a wedding or funeral.|
|2. Went to a sporting event.|
|3. Attended an opera.|
|4. Visited a country outside [your country]?|
|5. Had personal belongings such as money or a mobile phone stolen from you or from your house.|
|[Item included in the treatment group only]|
|6. Bought something you thought might have been stolen.|
The intention of the item count method is that respondents in the treatment group should feel able to include a truthful answer to the sensitive item in their response because they would realize that it will be hidden from the interviewer when only the total count is reported. Compared with the classical randomized response method, this has the advantage of avoiding the potentially distracting act of randomization by the respondents themselves during the interview. Potential disadvantages of the item count method are that only the treatment group provides information about the question of interest, and that the inclusion of the control items complicates the survey design and adds uncertainty to the estimation.
Table 2. Numbers of respondents with different reported totals for the item count question in the Euro-Justis survey
Quantities of interest can be estimated from randomized response data because the randomization mechanism is known. We may be interested in both estimates of the unconditional probability of the sensitive behaviour and regression-type questions about its associations with explanatory variables. For the item count method, a moment-based (mean difference) estimator has most often been used for the unconditional probability, and straightforward extensions of it for regression modelling (Chaudhuri and Christofides, 2007; Tsuchiya et al., 2007; Holbrook and Krosnick, 2010; Glynn, 2013; Coutts and Jann, 2011). These approaches, however, may be inefficient, and the regression methods in particular are not ideally suited for modelling a count response.
The key to more efficient and coherent analysis of any randomized response items is to treat it as a problem of incomplete categorical data. This idea has been applied to the modelling of classical randomized response designs by, for example, Maddala (1983), Scheers and Dayton (1988), Chen (1989) and van den Hout and van der Heijden (2004). For item count data, it was first fully recognized by Imai (2011), whose work represents a major advance in the methodology of the item count technique (see also Blair and Imai (2012); Corstange (2009) has also employed models for categorical data, for a related design where in the control group each item is asked individually).
In this paper we consider the modelling of item counts as categorical data. We expand on the results of Imai (2011) in two main ways. First, we propose a faster implementation of the estimation and an explicit estimate of the variance matrix of the estimated parameters. Second, we propose a more flexible set of choices for the model for the control items, which we argue can have a major effect on conclusions about the model of interest for the sensitive item. These points are developed in Section 'Modelling item count data'. An application on modelling criminal behaviour is introduced in Section 'Using survey data to model criminal behaviour' and analysed in Section 'Item count estimates of criminal behaviour'. In Section 'Considerations on the design of item count questions' we discuss the design of item count questions, and in Section 'Conclusions' we offer some conclusions.
2. Using survey data to model criminal behaviour
Our substantive application concerns criminal acts, specifically buying stolen goods. A prominent problem in criminology is understanding what shapes deviant and illegal behaviour. Why people do or do not commit crimes suggests what motivates them and how institutions can influence behaviour. To explain variation in such activity, we need a relatively precise estimate of it. Yet, criminal behaviour is a classic sensitive issue, which people tend to want to conceal in survey situations to create a good impression.
In the criminological literature, the different motivations can be grouped into instrumental and normative. Instrumental motivations are guided by rational choice where people make implicit or explicit calculations of the risks of action against the benefits; consonant modes of crime control policy focus on deterrence and punishment. Yet, thus far the evidence is mixed. It suggests that most people are not driven by calculations of the risks and benefits of committing crime. Accordingly, criminologists have started to move to trying to understand when and for which crimes deterrence might be an influence.
Normative motivations relate to the idea that reasons of morality explain why most people, most of the time, do not commit crimes. Influenced by socialization, family, friends and so forth, people do not act in ways that they believe to be wrong. Social pressure and disapproval plausibly reinforce one's motivation to act according to guiding moral principles. The important thing then is that people think that it is wrong to buy stolen goods, for example, and think that it is morally correct to obey the law simply because it is the law (Tyler 2006a, b; Jackson et al., 2012). Criminological research on these issues continues apace, no doubt with many important insights still to come.
An especially interesting area of on-going research draws instrumental and normative motivations into a single model of motivation. For example, Kroneberg et al. (2010) reasoned that only a certain proportion of individuals make cost–benefit calculations when considering whether to commit a particular act. People who believe that the act is morally wrong do not believe that it is an option (see also Wikström et al. (2012)). From an analysis of a survey of Germans, on the specific crime of shoplifting, their findings were consistent with the idea that ‘ … respondents with strongly internalized norms disregard instrumental incentives to shoplift’. In other words, if people believe that it is morally wrong to shoplift, then risk–benefit calculations will not predict intentions to commit the crime, but if they think that it is not morally wrong to shoplift then these calculations start to come into play.
We examine similar questions by using data from a survey that was carried out in Italy, Bulgaria and Lithuania in October–November 2010. The surveys were conducted separately by different organizations in the three countries but co-ordinated by the principal investigators in the UK. The main purpose of the survey was to measure the legitimacy of the criminal justice systems in these three European countries, as part of a broader project into trust in justice known as Euro-Justis (Hough and Sato, 2011). One of the key outcomes of the Euro-Justis project was the inclusion of a module of questions in round 5 of the European Social Survey in 2010 (European Social Survey, 2011); however, this did not include the item count question that is considered here. We use the pooled sample of 2549 respondents (1007 for Bulgaria, 521 for Italy and 1021 for Lithuania). The surveys are not treated as probability samples from the national populations. The main aim of our analysis is regression modelling of illegal behaviour rather than estimation of population proportions.
We consider questions that were motivated by Kroneberg et al. (2010), using a subset of their concepts. The crime that we consider is buying stolen goods. An item count question on it was included in the Euro-Justis survey, worded as shown in Table 1 (and translated into the national languages). The frequencies of responses to it in the pooled sample are shown in Table 2.
We consider two explanatory variables of primary interest. The first is the assessment of the morality of the crime, measured by a survey question whose core part was worded as ‘ … please tell me how wrong it is to … buy something you thought might be stolen’. The response options were ‘Not wrong at all’ (coded in our analysis as 1), ‘A bit wrong’ (coded as 2/3), ‘Wrong’ (1/3) and ‘Seriously wrong’ (0). The second variable that we focus on is personal financial circumstances, measured by the question ‘Which of [these descriptions] comes closest to how you feel about your household's income nowadays?’, with responses ‘Living comfortably …’ (0), ‘Coping …’ (1/3), ‘Finding it difficult …’ (2/3) and ‘Finding it very difficult on present income’ (1). We treat this as a measure of financial need but also as a rough proxy for perceived benefits of financial crime such as buying stolen goods—although acknowledging the obvious limitations of the latter treatment. ‘Don't know’ responses are coded as missing. In the survey, the question on morality was asked much earlier than the item count question, which came just before the question on financial need.
Our substantive hypotheses are that lower financial need and a view that buying stolen goods is morally wrong will be associated with a lower probability of having committed the crime. A further statistical hypothesis is that there would be a positive interaction between the two explanatory variables. This corresponds to the substantive hypothesis, in the spirit of Kroneberg et al. (2010), that strongly internalized norms lead people to disregard instrumental incentives to commit the crime.
The methodological questions that we consider next are how the item count data should best be analysed to answer the substantive questions, and what aspects of the quality of the data and the survey design affect the chances of success of this analysis.
3. Modelling item count data
Consider an item count survey question which includes J control items and one sensitive item. Suppose that we have data on respondents of whom have been randomly assigned into the control group where the question included only the control items, and to the treatment group where the sensitive item was also included. Let , i=1, …,n, be a treatment indicator such that for respondents i in the treatment group, and in the control group.
Let denote a respondent's answer to the item count question, with possible values 0,1, …,J+1, and its observed value in the sample. Here is the total for the control items, and if the answer to the sensitive item is yes and if it is no. For the control group , but for the treatment group and and are not observed separately (the value of for the control group is hypothetical and not used). Define and vectors s and Y similarly, and let where is a vector of explanatory variables for respondent i, including a constant 1.
We assume that for different respondents are independent given . The model for is specified through two models: for the sensitive item and for the total of the control items, for and . Here and are two subsets of the variables in , which need not be identical, and β and ψ are distinct parameter vectors. The substantive interest is on the model for , for which we use the binary logistic model
When , the only unknown parameter in this is where is the unconditional probability of positive response to the sensitive item. Different specifications for are discussed in Section 'Specification of a model for the control items' Let .
We make the following assumptions.
Under these assumptions, the model for the reported total in the observed data is
where we take ; these correspond to the two impossible values (0,J+1) and (1,0) of in the treatment group.
When there are no explanatory variables, a simple moment-based estimator of is
It follows from equation (2) that is an unbiased estimator of . It is also the maximum likelihood estimator under certain conditions, as discussed in Section 'Specification of a model for the control items'. However, it lacks the flexibility to accommodate less than saturated models for and . It also does not provide a generally convenient extension to models with explanatory variables. The most obvious generalization of equation (3) in this direction is the linear model (Holbrook and Krosnick, 2010). This yields a consistent estimator of β for a linear probability model for under certain assumptions about the model for . However, a linear model for is generally unappealing, and the approach again fails to provide flexibility for modelling .
A more satisfactory framework for the analysis of item count data is to treat it as the categorical data problem that it clearly is. For a convenient description, we introduce first the device of a set of pseudodata which consists of two stacked copies of the observed data set. The pseudodata thus have m=2n observations where, for each i=1, …,n, observations i and i+n have the same values of the observed variables. We then add the pseudovariables and with the values for i=1, …,n and for i=n+1, …,m. We denote and for i=1, …,m, where by definition is for i=1, …,n and for i=n+1, …,m.
The observed data log-likelihood is
or, in terms of the pseudodata,
One convenient way to maximize it is with the EM algorithm (Dempster et al., 1977). This is based on viewing the values of as missing data. If they and thus also all were observed, the complete-data log-likelihood would be
To express this in terms of the pseudodata, let there be equal to the true for each i=1, …,n. Then
where if and otherwise. This means that, if were observed, we would know which of the two paired rows of the pseudodata corresponded to the true value of . As we do not know this, the EM algorithm gives weight to both possibilities. The algorithm proceeds by alternating between two steps.
A maximum likelihood estimate of θ is obtained by iterating the algorithm to convergence, starting with some initial values . This is the estimation method that was proposed by Imai (2011) and implemented in the R software (R Core Team, 2012) with the package list (Blair and Imai, 2010).
Two well-known disadvantages of the EM algorithm are that it can be slow to converge and that it does not automatically provide an estimate of the variance matrix of . Here the difference in speed from our Newton–Raphson algorithm described below is enhanced by the fact that the EM implementation requires two iterative procedures at each M-step where the Newton–Raphson estimation involves only a single non-iterative step. In our application the difference is mostly in convenience rather than real practical significance: a typical model might take less than 1 s and 15 iterations with Newton–Raphson iteration and 12 s and 150 (E-step) iterations with EM. For the variance matrix, Imai (2011) used numerical differentiation of l(θ;s), as implemented in the optim function in R, to approximate the observed data information matrix. We propose to replace this with a closed form expression.
We make use of an elegant but relatively little-used result which is implicit in some earlier literature on the EM algorithm but which was first stated explicitly by Oakes (1999). This shows that the function derived at the E-step of the EM algorithm can also be used to calculate both the observed data score function, as
and the observed data information matrix, as
both of which hold for any value of . In these expressions it is crucial that θ and are treated as distinct quantities.
The results (8) and (9) allow us, first, to speed up the convergence of the estimation substantially by replacing the M-step of the EM algorithm with a Newton–Raphson update step
Second, an estimated variance matrix of the estimates is given by equation (9) as . Because the Newton–Raphson algorithm may also diverge, to achieve convergence it is important to use sensibly chosen starting values and/or to shorten the update step for some iterations if needed.
For the model for item count data, is given by equation (6) for . Here and depend only on θ, and , defined by equation (7), only on . Denote and , evaluated at β. Define where ; note that are equal to for i=1, …,n. Let be the matrix with rows for i=1, …,n, and the matrix with rows for i=n+1, …,m, and define and analogously with rows where is like but evaluated with . Let in general denote a diagonal matrix with the elements of a vector on the diagonal, and 1 a vector of 1s of appropriate length. The score function (8) is then
For the first term of equation (9),
and are zero matrices. The second term of equation (9) is
which is symmetric when and are evaluated at to become and .
These expressions apply when the model for is the logistic model (1). The specific forms of equations (10)-(12) depend on the choice of the model for . It can be seen that only the first and second derivatives of the logarithms of these probabilities are needed to complete the calculations. Explicit expressions for the four models that we shall consider are given in Appendix A. Finally, the observed data log-likelihood at the maximum likelihood estimates can be calculated from the pseudodata as
with the probabilities evaluated at .
More generally, any randomized response or comparable technique will involve incomplete data of some kind. This means that modelling methods for them can often be conveniently developed along similar lines to those above. References to such work for classical randomized response were given in Section 'Introduction'. For item count questions, this could be done for example for the modified version of Chaudhuri and Christofides (2007) which is designed to avoid the ceiling effect that is discussed in Section 'Considerations on the design of item count questions' (although some external information is then also needed by design). Jann et al. (2012) described such modelling for yet another method for sensitive questions: the crosswise model of Yu et al. (2008).
3.2. Specification of a model for the control items
The formulation of the problem in the previous section makes it clear that any analysis of item count data involves a model for the totals of the control items, whether or not this is explicit in the formulae of estimators. This model is a distinctive element of the method, which does not arise in classical forms of randomized response. It is a nuisance element which is of no substantive interest in itself. Nevertheless, it still needs to be specified appropriately, lest errors there distort estimates of the model of interest for . In this section we compare possible choices for the model for the control items.
There are two parts to the specification of the model for : the choice of the distribution itself, and how it may depend on the explanatory variables and . For the distribution, Imai (2011) and the computer implementation by Blair and Imai (2010) considered the binomial and the beta–binomial distributions. For the explanatory variables they used or , i.e. the same variables as in the model for , or these interacted fully with . Here we suggest some generalizations for both of these elements of the model. Details of the specific models that we use are given in Appendix A.
Let , j=1,…,J, denote respondent i's unobserved answer to control item j, with values 0 for no and 1 for yes, so that . Suppose that each follows a Bernoulli distribution with probability , and that different items may be dependent, with covariances . Then has mean and variance
where . The mean is equal to that of a binomial distribution with index J and probability . The first term of equation (13) is the variance of this binomial distribution, whereas the last two terms represent overdispersion or underdispersion relative to this variance. The second term, which is due to heterogeneity of the probabilities , is always negative and thus contributes underdispersion. The third term can be positive or negative, depending on the pattern of dependences between the .
If there is neither heterogeneity nor dependence, the last two terms of equation (13) are both 0 and the distribution of is binomial. If there is no dependence, the last term is 0 and we obtain the Poissonian binomial distribution in the sense of Johnson et al. (1992), section 3.12.2, which is always underdispersed relative to the binomial distribution. If there is no heterogeneity, the second term is 0. If we then further assume that the covariances are all equal, we obtain where ρ is the common ‘intraclass correlation’ between for respondent i. The beta–binomial distribution has a variance of this form. Its standard motivation as a mixture distribution implies that ρ is non-negative, but more generally the distribution is also well defined for some negative values of ρ (Prentice, 1986).
It is, however, undesirable to consider only such special cases of the distribution of . First, even a general version of the beta–binomial model, say, cannot accommodate a distribution which is strongly underdispersed relative to the binomial distribution. Such a distribution for the control items would in fact be ideal from a design point of view (see the discussion in Section 'Considerations on the design of item count questions'), so we should prefer a distribution which can represent such items if we do manage to create them. Second, in the item count context it may not be enough to model well only the mean and variance of . Expression (2) of the probabilities for the observed data shows that these involve all the probabilities of individual values of . An adequate model for all of them is thus needed to disentangle them correctly from the model for .
We suggest that by default the distribution of should be specified with maximum flexibility, as a multinomial distribution with index 1 and probabilities for z=0, …,J. In the examples below we compare the multinomial with the binomial and beta–binomial models. For dependence of the multinomial probabilities on explanatory variables we consider two possibilities: the multinomial logistic model, which ignores the ordering of the values of , and the ordinal logistic (proportional odds) model, which takes the ordering into account. The ordinal model is in principle appealing because it is flexible in the response distribution but relatively parsimonious in the parameterization of the effects of the explanatory variables.
Consider now choices for the explanatory variables for the model for . We denote these by . They include at least but possibly also and even products (interactions) between and some or all of . Note that does not need to include the same variables as , and there may be some gain in efficiency if it does not. Throughout, models with nested choices of may be compared by using likelihood ratio tests.
The most consequential aspect of the model for is the extent to which it depends on . In particular, the estimates will be most efficient when it does not, i.e. when and are conditionally independent given the explanatory variables. If this is so, expression (2) shows that . The parameters ψ could then even be estimated directly by fitting the model for for for the control group only, and the data in the treatment group will contribute information mostly about the model for . If the conditional independence does not hold, a smaller amount of information is available on both models, and both are mixed up in both groups.
Further insight into this loss of information is provided by the form of the information matrix I(θ) in equation (9). Its second term can be seen as the ‘missing information’ due to the fact that are not observed. All of its terms depend on the quantities , which are the predictive variances of the unknown given and . For ψ, and through the cross-derivative terms of I(θ) also for β, the missing information also involves contributions from , which are of the form . These describe how different the gradients of are at the two possible values 0 and 1 of . The magnitude of these differences, and thus the amount of missing information, depends on the specification of the model for . The key feature of the case where is conditionally independent of is that then for all observations in the control group, so they do not add anything to this component of the missing information.
A model of special interest is one which has and multinomially distributed with . This uses 2J+1 parameters to model the J+1+J=2J+1 free probabilities in the table of randomization group t by observed item count s (see Table 2). The model is thus saturated, and the maximum likelihood estimators of are the observed sample proportions . Solving the expressions of these probabilities in expression (2) for and , we obtain as estimate of the mean difference given by equation (3), and for
for j=0, …,J, starting with (see also Glynn (2013) (supplementary material), who gave corresponding expressions for the probabilities of Y given Z). These are equal to the maximum likelihood estimates of and obtained as in Section 'Estimation', if are all non-negative (the case where they are not is discussed in the next section). This equivalence demonstrates that any gain in efficiency that is obtained by the maximum likelihood estimators over the mean difference (3) is not due to the formulation of the problem as a model for categorical data but is only realized if we can assume a more parsimonious model for than a multinomial model with and are independent.
We conducted a limited simulation study of the effect of the specification of the model for . Four situations were considered, with 1000 data sets simulated in each. There were no explanatory variables. In each case the sample size was 2400 with 1200 observations in each of the treatment and control groups, was drawn from a Bernoulli distribution with probability and there were five control items. The four cases differ in the model for and represent different combinations of its distribution and whether this depends on . In cases 1 and 2, follow a binomial distribution, in case 1 with probability 0.25 for all respondents, and in case 2 with probability 0.25 when and 0.355 when . In cases 3 and 4, follow a multinomial distribution; in case 3 with probabilities (0.2,0.2,0.2,0.2,0.1,0.1) for Z=0, …,5 for every respondent; in case 4 these probabilities apply when and (0.1,0.15,0.15,0.2,0.2,0.2) when . The multinomial probabilities are chosen so that they are not well represented by a binomial distribution.
Table 3 shows the results of the simulation for estimates of . We consider maximum likelihood estimators under the four models listed in Appendix A, each under both the assumption that and are independent and that they are dependent, thus for a total of seven estimators (the multinomial logistic and ordinal logistic model are equivalent under independence). There were a handful of simulations where some estimators converged to a very small value (less than −10 on the logit scale). These may represent false convergence of the algorithm, so simulations where this happened for any estimator are excluded; there were no more than 19 instances of this in any of the four cases.
Table 3. Results of a simulation study of estimators of the probability of a sensitive item in item count data†
|Mean (true value is 100)|
|independent|| || || || || || || |
|dependent|| || || || || || || |
|independent|| || || || || || || |
|dependent|| || || || || || || |
| Root-mean-squared error: |
|independent|| || || || || || || |
|dependent|| || || || || || || |
|independent|| || || || || || || |
|dependent|| || || || || || || |
The simulation means in the upper part of Table 3 show that all the estimators are approximately unbiased when the model for is correctly specified or overparameterized. When this model is incorrectly specified, however, the estimator of is biased, in many cases dramatically so. This happens when either the distribution of or the association between and is misspecified. It is worth noting that at least in these cases the one additional parameter of the beta–binomial model reduces the bias substantially relative to the binomial model when the true model is multinomial.
The main message of the lower part of Table 3 concerns the loss of efficiency when we must assume dependence between and . This can be seen by comparing the results under independence models for correctly specified distributions. It can be seen that this loss of efficiency is small when the true distribution is binomial but much larger when it is multinomial. In the latter case the use of an ordinal model for the dependence improves the efficiency slightly but still leaves it far lower than that of the independence model.
4. Considerations on the design of item count questions
Careful design of the survey items is clearly a prerequisite for the success of the item count methodology. Here we consider briefly some elements of design (for more extensive discussions, see for example Glynn (2013), Blair and Imai (2012), and references cited therein). We focus on the technical questions of articulating the assumptions that the items should satisfy, how these affect the model specification and how they may be checked in the analysis. This is informative on, but does not directly answer, the central practical challenge of design, which is how we should choose the items to have a good chance of satisfying the assumptions. This question is not amenable to simple technical analysis or solutions, because the success of the exercise is ultimately dependent on how the survey respondents react to the questions. For understanding and predicting these reactions, the designer of an item count question will benefit from a good knowledge of the general theories and empirical evidence on the psychology of survey response (Tourangeau et al., 2000).
The item count technique or any other randomized response method is also likely to involve psychological peculiarities of its own. These relate to what might be termed the ‘weirdness factor’ of the method, which is created when the interviewer appears to the respondent to deviate from the implicit social contract of what a survey interview should involve. With a classical randomized response question this happens when the respondent is suddenly asked to do something like to spin a dial to choose at random which question they should answer. In an item count question, the weirdness arises from being presented with a list of apparently disparate items with no indication of why the total of them might be of interest to the interviewer. The strangeness of this may be less than that of the dial spinning but it can still be non-negligible. It thus cannot be taken for granted that the respondents will react to an item count list exactly as intended, even if all the individual items are ostensibly simple and easy to answer.
Validity of item count measurement requires that the assumptions (a)–(d) stated at the beginning of Section 3.1 are satisfied. Of them, assumption (b) is satisfied by the randomization unless it is undermined by failure of assumption (d), i.e. differential non-response. The other assumptions cannot be guaranteed through design.
Assumption (a) of no lying is the motivation of the item count technique in the first place, in that it is designed to reduce reasons for lying by guaranteeing anonymity. This protection will fail completely in one situation: the ‘ceiling effect’ of a respondent in the treatment group whose truthful answer to all of the items would be affirmative, in which case a truthful total of J+1 would logically reveal the answer to the sensitive item. Direct evidence of the prevalence of this problem is given by the proportion of counts of J in the control group. In design, the aim should then be to select control items for which few respondents would give only affirmative answers. One way to achieve this is to use items which are individually rare. Another, which also reduces the chances of the ‘floor effect’ discussed below, is to choose a control set where some pairs of items are negatively correlated (Glynn, 2013).
Often discussed alongside the ceiling effect is the floor effect of a respondent in the treatment group whose truthful answers would be negative to all the control items but affirmative to the sensitive item. The argument is that such a person might judge that a truthful count of 1 would be known to correspond to the sensitive item. This, however, follows logically only if the interviewer can reasonably conclude from their observation of that respondent that his or her answer to all the control items must be no, a situation which should not be allowed to arise with a sensible set of control items. In other cases, concern with the floor effect involves a less compelling argument which requires reference to a population of other respondents, i.e. a judgement that the control items are such that most people's answers to all of them are likely to be negative. This does not necessarily follow even if all the items are individually rare.
A violation of the non-response assumption (d) can also lead to violation of assumption (c). Apart from that, assumption (c) is essentially a requirement of compliance, that the respondents actually respond to the question as stated and report a sum of the items rather than somehow react to the list as a whole (in which case it could matter whether the sensitive item was on it or not). Assuring this at the design stage is a considerable challenge. In the analysis, there is one obvious if partial way of examining the validity of the assumption. This is to use the logical conditions that P(S≤s|T=0)≥P(S≤s|T=1) for all s=0, …,J−1 and P(S≤s−1|T=0)≤P(S≤s|T=1) for all s=1, …,J (Blair and Imai, 2012). If these do not hold for all sample proportions, some of the moment estimates (14)–(15) of the probabilities of Z will be negative. This may occur because of sampling variation even when assumption (c) is satisfied, so a significance test for the conditions is needed; such a test was proposed by Blair and Imai (2012). A test result that does not detect a significant violation is not sufficient evidence that the assumption is satisfied, but a significant test result does provide strong evidence that it is not. Furthermore, apart from sensitivity analysis of the possible biases there is nothing that can really be done in the analysis to adjust for such a violation. The conclusion from this part of the analysis may thus be the disheartening one that an item count question is irretrievably flawed.
Another element of validity is that of the model specification, especially of the model for the control items as we argued in Section 3.2. Even when correct, this model also affects the efficiency of the analysis, i.e. the extent of missing information that reduces the precision of the estimates of the parameters of interest. This depends on the complexity of the model for the control items, most of all on whether or not they are conditionally independent of the sensitive item. This conditional independence should be one aim of the design of an item count question. It should not be impossible to achieve when the control items are unrelated in content to the sensitive item, but it cannot be guaranteed in advance.
These considerations suggest that the ideal set of control items would be one for which every respondent would report the same count z with 0<z<J, achieved in such a way that the z would not be the same items for everyone (to avoid a version of the floor effect). Such items would both satisfy all the assumptions for validity and maximize the efficiency of the estimates. They are of course unachievable in practice but worth keeping in mind as a general aim.
The design of an item count question is likely to involve trade-offs between different aims. For example, a list of control items which are independent of the sensitive item and negatively correlated with each other may seem particularly odd to the respondent and thus increase the risk of non-compliance. Chaudhuri and Christofides (2007) emphasized this danger and recommended instead choosing sensitive and control items which are thematically related—which will then mean that they are unlikely to be statistically independent.
How then do the Euro-Justis item count questions measure up against these criteria? When the questions were designed, there was a conscious attempt to avoid very common items and some aim to include items which would be weakly or even negatively correlated with each other. To try to reduce the weirdness factor, the list consists of relatively general and not strikingly peculiar inquiries. Furthermore, item 5, which asks whether the respondent has been a victim of crime, was included to try to make the list seem a little less out of place in a survey that was otherwise mostly about crime and criminal justice. This, however, may have had the consequence of introducing an association between the sensitive item and the total of the control items. It is clear from the analysis below that there is indeed such an association, and it may be because victimization and criminal behaviour tend to be correlated.
The survey organizations in the three countries of the Euro-Justis survey each produced a field report where the interviewers summarized their own experiences and common reactions by the respondents. It is encouraging to note that none of the three reports mentioned any concerns about the item count question. This contrasts with comments, which came consistently from all the countries, that many respondents had reacted negatively to direct questions elsewhere in the survey on other sensitive topics such as family income and attitudes towards crime.
Only 5.6% of the respondents in the control group and 4.7% in the treatment group refused to answer the item count question, so non-response is unlikely to be a major source of bias. The potential for ceiling effects is also minimal, as only 21 of 1206 respondents in the control group gave the maximum value of 5. Less reassuring is the finding that two of the cumulative proportions in the observed sample are inconsistent. These are the proportions for counts of 0 and 2, where the cumulative probability is smaller in the control than in the treatment group (0.223 versus 0.230 and 0.828 versus 0.830). These differences are not statistically significantly negative, so they can be due to sampling variation. Nevertheless, they give some reason to worry that assumption (c) of consistency of responses may be violated.
5. Item count estimates of criminal behaviour
Table 4 shows estimates for models without explanatory variables for the item count question in the Euro-Justis survey. The quantity of main interest is the probability of the sensitive item Y—having bought stolen goods in the past year. Table 4 also includes estimates of the probabilities of different counts Z for the five control items. Here the focus is on how the estimates of are affected by different choices for the model for the control items. We consider the moment-based estimators given by equations (3), (14) and (15), as well as maximum likelihood estimators with each of the four models for Z that are listed in Appendix A, each of the latter both with and without the assumption that Z and Y are independent. For assessment and comparison of model fits, Table 4 includes the Akaike information criterion statistic where q is the number of estimated parameters in a model, and the p-value for a -test of goodness of fit which compares the fitted counts for reported totals S from each model with the observed counts that are shown in Table 2.
Table 4. Probabilities of buying stolen goods, , and of different counts for the five control items, estimated from the item count question in the Euro-Justis survey†
|Moment||Y=0||0.24||0.39||0.22||0.11||0.04||0.01|| || || || |
|Multinomial||Y=0||0.24||0.39||0.22||0.11||0.04||0.01|| || || || |
|Ordinal||Y=0||0.23||0.39||0.23||0.11||0.04||0.01|| || || || |
|Beta–||Y=0||0.25||0.38||0.25||0.10||0.02||0.00|| || || || |
|Binomial||Y=0||0.25||0.40||0.26||0.08||0.01||0.00|| || || || |
As discussed in Section 'Considerations on the design of item count questions', two of the moment-based estimates of the probabilities of Z are negative; these probabilities have boundary estimates of 0 in the saturated multinomial logistic model where Y and Z are dependent. For each model for Z, the hypothesis of independence between Z and Y is clearly rejected, and there is a substantial difference between the estimated probabilities of Z conditional on the two values of Y. The probabilities given Y=0 are generally similar to those obtained when Z and Y are assumed independent, whereas the estimated probabilities given Y=1 are much less stable.
For these data, the estimated proportion of people who have bought stolen goods is fairly sensitive to the model assumed for the control items. Point estimates of the proportion are 0.12–0.14 when the clearly inappropriate assumption of independence between Z and Y is made, but 0.02–0.09 without it. Different dependence models also produce rather different results. The binomial and beta–binomial models give the higher estimates of 0.06–0.09. The goodness of fit of these two models is inadequate according to the -test even when Z and Y are dependent. Only the multinomial models where Z depends on Y yield a good fit, both with a multinomial logistic and an ordinal logistic model for the dependence. Estimates of are then 0.041 with the multinomial and 0.017 with the ordinal model.
The differences between these estimates appear somewhat less dramatic when we acknowledge also the uncertainty in them, as revealed by the estimated standard errors for shown in Table 4. The 95% confidence intervals for (derived from intervals for ) are (0.016–0.103), (0.005–0.058), (0.037–0.108) and (0.064–0.117) for the multinomial, ordinal, beta-binomial and binomial dependence models respectively, so there is substantial overlap between most of them. This would increase further if we tried to allow for misspecification of the models for the control items by using, say, sandwich-type estimators of the standard errors. The same would be true to a smaller extent for results obtained with independence models for the control items, but there the primary conclusion must still be that these models would lead to non-trivially different conclusions about the quantity of interest.
In Table 5 we turn to regression modelling of the item count question. The first analyses have given us some reason to approach this with caution, as the apparent association between the sensitive and control items will reduce the available information, and as there is some indication of non-compliance by the respondents. However, it is still of interest to see what the item count question can reveal about associations between explanatory variables and self-reported criminal behaviour.
Table 5. Regression models for the item count question in the Euro-Justis survey†
| Model for sensitive item Y (buying stolen goods) |
|Constant||−2.97|| ||−3.11|| ||−3.82|| ||−4.72|| |
|Age||−0.01||(0.01)||−0.01||(0.01)|| || || || |
|Morality × need|| || || || || || ||−2.58||(2.16)|
| Model for the total Z of the control items |
|Morality||−0.17||(0.16)||−0.28||(0.32)|| || || || |
| Y || || ||0.99||(0.40)||1.11||(0.42)||0.99||(0.38)|
We consider two substantively interesting explanatory variables: the respondents’ judgement of the moral acceptability of buying stolen goods and their self-reported economic circumstances. The motivation and definitions of these variables were given in Section 'Using survey data to model criminal behaviour'. The respondent's age in years is also included as a control variable. For the control items we use an ordinal logistic model. This provides flexibility for the choice of the distribution by treating it as multinomial but is more parsimonious than a multinomial logistic model in how the effects of the explanatory variables are specified.
Estimates for four models are shown in Table 5. Models 1 and 2 include all three explanatory variables in the models for both Y and Z. They differ in that in model 1 the outcomes Y and Z are conditionally independent given the predictors, whereas in model 2 Z depends also on the main effect of Y as an additional explanatory variable (the model with interactions between Y and the other predictors was not significantly different from this). The difference between these models is statistically significant, so the previous conclusion that Y is significantly associated with Z holds even after controlling for the three explanatory variables. A comparison of the estimated coefficients and their standard errors between these two models shows clearly that conclusions about the model of interest may be strongly affected by whether or not an association between Z and Y is included, and that uncertainties are substantially increased if it needs to be included.
In model 3 we remove two explanatory variables from model 2: age from the model for Y and moral judgement from the model for Z. Neither is significant in model 2, and for the morality variable there is also no substantive motivation for considering it as a predictor for the control items. In this model, the associations involving Z are strongly significant and substantially sensible. They indicate that older people and people who are struggling on their present income tend to have engaged in fewer of the activities on the control list. The effect of Y is that people who have bought stolen goods tend to report higher totals for the control items. As discussed in Section 'Considerations on the design of item count questions', a possible substantive explanation of this involves the control item on having been a victim of crime.
The model of interest in model 3 includes moral judgement and financial need as explanatory variables. These are taken to represent aspects of normative and instrumental motivations of criminal behaviour respectively. Their estimated effects are in expected directions: respondents who have a higher need are more likely to have bought stolen goods, as are people who do not regard such action as morally wrong. The coefficient of moral judgement is not significant, but that of financial need is. For at least one explanatory variable the item count has thus provided enough information for us to be able detect a positive association between it and the sensitive item, separate from its (negative) association with the control items.
Finally, in model 4 we examine the last substantive hypothesis that was discussed in Section 'Using survey data to model criminal behaviour': that of an interaction between morality and need. Its point estimate is negative. This would be an intriguing conclusion in that it would suggest that moral judgement makes a difference only when need is low, and need makes a difference only when an act is judged immoral in general—which would be the exact opposite of the hypothesis that was proposed by Kroneberg et al. (2010). However, the interaction is clearly not significant so no firm conclusions should be drawn. It is apparent that estimating such an interaction is beyond what the information from these item count data can reliably support.
The item count method is a valuable and increasingly commonly used addition to the methodology of asking sensitive survey questions. It has some definite advantages over both direct questioning and other randomized response methods. Statistical analysis of item count data can be implemented elegantly and efficiently with methods for categorical data analysis for incomplete data. We illustrated this in our application, where substantively plausible models for illegal behaviour were obtained.
The method also has its disadvantages and peculiar methodological challenges. Most of these stem from the distinctive feature of an item count question, which is the list of the control items. Even though these items are of no direct substantive interest themselves, careful attention must be paid to them so as not to compromise information about the sensitive item of interest. We have argued that at the analysis stage sufficiently flexible model specification for the total of the control items is crucial, in particular that it should usually be treated as multinomially distributed.
Most of the effort and ingenuity in the design of an item count question should also be devoted to the control items. For validity, responses to them should not be affected by the presence of the sensitive item on the list or to give respondents reasons to lie about it, and for efficiency the control items should ideally be independent of the sensitive item. At the design stage it is not easy to be confident that these conditions will be satisfied. At the analysis stage, failures of them cannot always be detected and, even when they can, are typically not correctable.
All of this makes a substantial challenge for designers of surveys on sensitive topics, and a challenge which will no doubt generate much future research. One practical recommendation that it suggests is that we should aim to build up a body of knowledge about specific item count questions, so that control items which have been found to work well in the past could be used again, even in item count surveys of different sensitive topics.
The Euro-Justis project (full title ‘Scientific indicators of confidence in justice: tools for policy assessment’) was funded by the seventh framework programme of the European Commission under grant agreement 217311. We are grateful to Daniel Oberski for introducing the item count method to us, and to Ben Lauderdale for helpful comments. We thank the Joint Editor, the Guest Associate Editor and a referee for their comments and suggestions.