Misreporting and econometric modelling of zeros in survey data on social bads: An application to cannabis consumption

Abstract When modelling “social bads,” such as illegal drug consumption, researchers are often faced with a dependent variable characterised by a large number of zero observations. Building on the recent literature on hurdle and double‐hurdle models, we propose a double‐inflated modelling framework, where the zero observations are allowed to come from the following: nonparticipants; participant misreporters (who have larger loss functions associated with a truthful response); and infrequent consumers. Due to our empirical application, the model is derived for the case of an ordered discrete‐dependent variable. However, it is similarly possible to augment other such zero‐inflated models (e.g., zero‐inflated count models, and double‐hurdle models for continuous variables). The model is then applied to a consumer choice problem of cannabis consumption. We estimate that 17% of the reported zeros in the cannabis survey are from individuals who misreport their participation, 11% from infrequent users, and only 72% from true nonparticipants.


INTRODUCTION AND BACKGROUND
Recreational drug use is one of the major social problems that policymakers face across the world, being associated with crime, violence, and more importantly, adverse health consequences. To this end, there exists a vast literature concerned with the empirical modelling of a wide array of data on drug use to help inform policymakers. Research on drug use and its consequences has arisen from several disciplines. Drug users' behaviour has been studied extensively in psychology, sociology, and medical arenas. In the last few decades, economists have also shown a growing interest in the study of drug use and its consequences. They have brought unique and useful perspectives to the understanding of drug users' behaviour, the onset of drug use, and abuse prevention, all of which have made important contributions to the drug policy debate. Crucial to effective policy analysis is the scope and quality of the drug data.
The availability of individual/household-level drug data has brought an improved understanding of consumer behaviour. The analysis of differential policy responses by demographic characteristics, such as age, gender, and ethnicity, has been very useful for the development of drug policies and other educational programmes. Such data are invariably collected using survey techniques. The accuracy of the information gathered is dependent on reliable and accurate responses by the respondents. It may be the case, however, that respondents have an incentive to misreport their drug consumption given that drugs, in particular those of an illicit nature, are associated with legal risks, as well as stigma, or social disapproval. Such misreporting may lead to information being misclassified in survey data, which can mask the incidence of such behaviours and lead to biased and inconsistent estimates in statistical analyses (Hausman, Abrevaya, & Scott-Morton, 1998). With an increasing use of survey data for drug policy analysis, it is therefore crucial to explore the incidence and implications of misreporting.
Although survey misclassification or misreporting (in the case of drugs and in general) is known to be pervasive and to potentially bias statistical/econometric analyses, there is a limited body of research that have explicitly modelled such behaviours. The relevant studies on misclassified dependent variables include Hausman et al. (1998), Abrevaya and Hausman (1999), Lewbel (2000), and Dustmann and van Soest (2001). Hausman et al. (1998) use a parametric approach to estimate misclassification probabilities when the functional form of the distribution of the error term is known. They consider a binary choice model with two types of misclassification: the probability that the true 0 is recorded as a 1; and vice versa. Abrevaya and Hausman (1999) and Lewbel (2000), on the other hand, consider a semiparametric approach to estimate the misclassification probabilities, where the distribution of the error term is unknown, except that the latter allows the misclassification probabilities to be covariate-dependent functions. Dustmann and van Soest (2001) and, more recently, Greene, Harris, and Hollingsworth (2015) extend the parametric model of Hausman et al. (1998) to ordered data.
More recently, researchers such as Mahajan (2006), Hu (2008) and Molinari (2008) have attempted to model misclassification in discrete-dependent variables using a secondary measurement or an instrument to identify a nonlinear model. Their approach is based on the observation that in the presence of classification errors, the relation between the true variable and its misclassified representation is given by a linear system of simultaneous equations in which the coefficient matrix is the matrix of misclassification probabilities. The so-called anchoring, focal point answers, and crude rounding in surveys have also been a subject of interest among researchers (Kleinjans & van Soest, 2014;Manski & Molinari, 2010;van Soest & Hurd, 2012). Using a random-effect multinomial logit model, Kleinjans and van Soest (2014) explicitly account for such reporting behaviour including nonresponse where respondents decide not to report any value. Finally, anchoring vignettes have also been used to measure discrepancies in reporting behaviours, particularly in the case of self-reported health and life satisfaction (Kristensen & Johansson, 2008;van Soest, Delaney, Harmon, Kapteyn, & Smith, 2011); however, vignettes are very rare in most mainstream data sets.
This paper makes an important contribution to this received literature. The main aim is to develop a latent-class or partial observability-type modelling approach to analyse the extent of misreporting in drug consumption information collected using a large national drug survey. Our particular interest lies in misreporting in the context of cannabis consumption. However, the technique will also be generally applicable to empirical health models, such as sexual health and mental health, that involve sensitive responses and where there is a potential for inaccurate measurement of the variable(s) of interest. Essentially, we assume that for a "sensitive" response variable (such as drug use), there is an associated loss-function (either perceived or actual, social and/or legal) involved for the individual in terms of the responses he/she reports. Here, it is clear that the researcher must be aware of the potential for misreporting. For example, there will be a strong incentive for individuals to misreport (presumably under-report) their true consumption levels for fears of legal (and/or moral) repercussions (see, e.g., Pudney, 2007). This typically gives rise to a preponderance of "zero" observations in the data set.
There is a suite of hurdle and double-hurdle models that have been developed over the years to address the build-up of "zero" observations where the response variable is either a continuous variable with a nonzero probability mass at (typically, but not exclusively) zero levels (Cragg, 1971;Jones, 1989;Smith, 2003), or a count variable (Greene, 1994;Heilbron, 1989;Lambert, 1992;Mullahy, 1986;1997;Pohlmeier & Ulrich, 1995), or an ordered discrete variable (Harris & Zhao, 2007). For example, in the zero-inflated ordered probit (ZIOP) model, Harris and Zhao (2007) consider ordered levels of tobacco consumption and argue that the reported zeros could arise from both nonparticipants and infrequent consumers. In the same spirit, following the standard double-hurdle arguments, we suggest that the build-up of "zero" observations may correspond to both nonparticipants and participant (but infrequent) consumers. However, for these "social bads" with associated reporting loss-functions, we also suggest a third source, involving those participants who, potentially due to fear of repercussions, report zero-consumption when this in fact is not so.
This concept can be applied to the range of models mentioned above that exhibit a preponderance of zeros such as the zero-inflated Poisson, ZIP, and other double-hurdle model(s). Here, in view of our application to illicit drug use recorded on an ordinal scale, we focus on a ZIOP model; although the techniques can be similarly applied to other statistical models. Explicitly, we propose a three-tiered approach: the first equation determines the participation decision; the second conditional on participation, determines whether an individual misreports; and finally, the third, for participants who report accurately, an ordered probit model determines the levels of consumption, which also include zero consumption of infrequent users. We term this generalisation of the ZIOP model, the double ZIOP (DZIOP) model. In research in areas of discrete random variables that are inherently ordered, misreporting has sometimes been addressed by allowing the model's inherent boundary parameters to vary by observed personal characteristics (Greene & Hensher, 2010). Here, in addition to the "fundamental" form of modelling misreporting, we can also allow for more general under-(or over-) reporting, by allowing the boundary parameters to vary by observed characteristics.

A DZIOP model
Our approach entails a fundamental form of modelling the misreporting, which is likely to be present in data that are perceived to embody a strong loss-function (social and/or legal) for the individual. Following the ZIOP model of Harris and Zhao (2007), we start by defining a discrete random variable y that is observable and assumes the discrete ordered values of 0, 1, … , J. A standard OP approach would map a single latent variable to the observed outcome y via so-called boundary parameters, with the latent variable being related to a set of covariates (Greene & Hensher, 2010). However, the ZIOP model involves two latent equations: a probit selection equation and an OP equation. As with the more traditional double-hurdle models (Jones, 1989), individuals here have to overcome two hurdles before one observes nonzero consumption: whether to participate, and then, conditional on participation, the amount to consume, which also includes zero consumption.
However, it is our contention here that, especially regarding the consumption of "social bads" (e.g., licit and, in particular, illicit drugs), participants may intentionally misreport their true consumption patterns. In particular, we hypothesise that a (probably significantly large) proportion of participants may under-report their true consumption levels by simply stating zero consumption. That is, we contend that, for example, if a user is concerned with legal ramifications of admitting drug use, he/she will typically prefer to misreport "none at all," as compared to simply under-reporting their true use. The alternative assumption would be akin to someone feeling more comfortable to admitting breaking the law "by just a little." Finally, participants who do not misreport, as with the ZIOP model, are free to select any of the j = 0, … , J outcomes. In this way, observed zero-consumption can arise from the following: (a) nonparticipants; (b) participants who misreport; or (c) participants who do not misreport, but who are infrequent consumers (e.g., who happen to not have used drugs in the past 12 months). Thus, as compared to say, a standard OP approach, the zero observations are "double-inflated": once by nonparticipants and then by misreporters. We suggest a three-tiered sequencing of decision making. First, the individual makes a decision whether to participate or not; secondly, there is the decision to misreport or not; and finally, there is the decision on how much to consume.
Following Harris and Zhao (2007), we let r denote a binary variable indicating the split between Regime 0 (with r = 0 for nonparticipants) and Regime 1 (with r = 1 for participants). Although unobservable, r is related to a latent variable r * via the mapping: r = 1 for r * > 0 and r = 0 for r * ⩽ 0. The variable r * represents the propensity for participation. It is related to a set of explanatory variables (x r ) with unknown weights r and a standard normally distributed error term r such that A second latent variable m * represents the propensity to misreport. Again, this is related to a second unobserved variable m such that m = 1 for m * > 0 and m = 0 for m * ⩽ 0, where m = 0 represents a misreporter and m = 1 a truthful reporter. Again, we can write this as a linear latent form as Finally, consumption levels under Regime 1 are represented by a discrete variableỹ (ỹ = 0, 1, … , J) generated by an OP model via a third latent variableỹ * such thatỹ with the standard mapping ofỹ where is a vector of boundary parameters to be estimated (the extreme values, 0 and J , are normalised at 0 and +∞, respectively). Of course, as with the ZIOP model,ỹ is not directly observed. Nor is either r or m. Here, the observability criterion for observed y is y = r × m ×ỹ.
An observed y = 0 outcome can arise from three distinct sources: r = 0 (the individual is a nonparticipant); r = 1 (the individual is a participant) and jointly that m = 0 (the individual is a misreporter); and finally, that jointly r = 1, m = 1, andỹ = 0 (the individual is a zero consumption, accurate-reporting participant). In the same way, to observe a positive y, we require jointly that the individual is a participant (r = 1) and an accurate reporter (m = 1) and thatỹ * > 0. This setup is one of partial observability in line with models proposed by Poirier (1980) and Meng and Schmidt (1985). For the time-being, assume that the stochastic terms ( = r , m , y ) are independent and follow standard Gaussian distributions. The full probability for y = 0 is given by and for the remaining outcomes By independence, these joint probabilities are simply products of the marginals such that, under the usual assumption of normality, they are given respectively by and Pr The probability of a zero observation has been "double-inflated" as it is a combination of the probability of "zero consumption" from the OP process and the probability of "nonparticipation" from the split probit model plus that from misreporting. Note that as per the ZIOP model, there may or may not be overlaps with the variables in the partitions in x r , x m , and x y , although undoubtedly identification will be aided by such. Given the assumed form for the probabilities and an i.i.d. sample of size N from the population on (y i , the parameters of the full model = can be consistently and efficiently estimated using maximum likelihood techniques. The log-likelihood function is where the indicator function h ij is Maximization was performed using the Broyden-Fletcher-Goldfarb-Shanno algorithm (Greene, 2008). Robust standard errors were computed based on the common "sandwich" estimator. Standard errors of secondary estimated quantities, such as partial effects and summary probabilities, were estimated using the Delta method. 1 Clearly, to apply a similar set-up to count or continuous dependent variables, one could simply replace the OP densities above with the appropriate ones for the data at hand. 2

Generalising the model to correlated disturbances (DZIOPC)
As described above, the observed realisation of the random variable y can be viewed as the result of three separate latent processes with uncorrelated error terms. However, these three outcomes correspond to the same individual, so it is likely that the vector of stochastic terms i will be related across equations. We now extend the model to have ( r , m , y ) follow a multivariate normal distribution with covariance matrix Ω 3 , whilst maintaining usual probit normalisations and unit variances. The full observability criteria are thus y = rmỹ = { 0 if (r * ⩽ 0) or (r * > 0 and m * ⩽ 0) or (r * > 0 and m * > 0 andỹ * ⩽ 0) j if ( r * > 0 and m * > 0 and j−1 <ỹ * ⩽ j ) (j = 1, … , J − 1), J if (r * > 0 and m * > 0 and J−1 <ỹ * ) , which translate into the following expressions for the probabilities where Φ 3 (.) and Φ 2 (.), respectively, denote the c.d.f. of the standardised trivariate and bivariate normal distribution and where Ω 2 is the relevant submatrix of the full Ω 3 matrix. Maximum likelihood estimation would again involve maximization of Equation (9) replacing the probabilities of (8) with those of (12) and redefining as =

is a joint test for independence of the three error terms
and thus a test of the more general model given by Equation (12) against the null of a simpler nested model of Equation (8).

AN APPLICATION TO CANNABIS CONSUMPTION
Cannabis use imposes a high social and economic cost on society and has been a major concern to policy-makers worldwide. It is the most commonly used drug after tobacco and alcohol, particularly in the younger population. A large amount of public funds have flowed into promotional campaigns and rehabilitation programmes in many countries across the world in order to treat and prevent cannabis-related harm. This has resulted in a growing importance of research in order to develop sound policies and strategies. The quality of the evidence from these scientific investigations is an important concern, however. Because cannabis possession and market transactions constitute illegal activities in most jurisdictions, there is a strong incentive for cannabis users to conceal their behaviour, for fear of punishment. The concealment of cannabis use can also result from embarrassment or social disapproval (Swadi, 1990). Such misreporting can have a significant impact on the quality of research findings. A major focus of this paper therefore is to examine the profile of those people who misreport their cannabis consumption.

The data
The data we use for the model are drawn from the Australian National Drug Strategy Household Survey, which is a nationally representative survey of the noninstitutionalised Australian civilian population aged over 14 providing information on drug use patterns, attitudes, and behaviour (NDSHS, 2010). A multi-stage, stratified area sample design ensured a random sample of households in each geographical stratum. As mentioned above, there has been some discussion in the existing literature regarding the potential for misreporting to be influenced by how the survey is conducted. The earlier waves of the National Drug Strategy Household Survey used face-to-face and drop-and-collect methods to collect data. The computer-assisted telephone interview (CATI) method of data collection was introduced in the 2001 survey. In that particular survey, all three methods were employed to collect data. The 2004 and 2007 surveys, on the other hand, were administered using only the drop-and-collect and CATI methods, whereas the more recent surveys were conducted only using the CATI method. We restrict our study to the 2001, 2004, and 2007 surveys for the following reasons. The older surveys included inconsistent questions with regard to the key variable of interest, whereas the more recent surveys were conducted using only the CATI method. As discussed later in the paper, variation in the method of collection is key to identifying the misreporting equation. Definitions of all variables used in the study are given in the Appendix. A sample of 50,153 individuals is thus available for estimation. This data set has been used in several previous studies (see, e.g., Cameron & Williams, 2001;Harris & Zhao, 2007;Williams, 2004;Zhao & Harris, 2004).
In this data set, neither the monetary expenditures on nor the physical quantities of cannabis consumed are reported. The information on individuals' consumption of cannabis is given via a discrete variable measuring the participation and intensity of consumption in the last 12 months. In particular, the information in the data concerning an individual's consumption of cannabis is collected through the question "Have you used cannabis/marijuana in the last 12 months" and "In the last 12 months, how often did you use cannabis/marijuana?", where the responses to the frequency of use take the form of one of the following choices: not at all (y = 0); using cannabis once or twice a year (y = 1); using cannabis monthly or every few months (y = 2); and using cannabis everyday or once a week (y = 3).
In terms of explanatory variables, we have three blocks: x r , to determine participation; x m , for misreporting; and x y , to determine consumption levels. Although many of the variables overlap (as we have no a priori information as to where they should appear in the model and where not), to facilitate identification, we apply some natural exclusion restrictions. The common variables in the three equations include a wide range of personal and demographic characteristics, namely, gender; marital status; individual's (standardised) age; a dummy variable for whether there are preschool children in the household; whether the individual comes from a single parent household; a dummy variable for whether the individual resides in a capital city; and a dummy variable for whether the individual is of Aboriginal or Torres Strait Islander origin.
We also control for educational attainment, distinguishing between four categories of highest educational attainment: a tertiary degree; a nontertiary diploma or trade certificate; year 12 education; and less than year 12 education, which is the omitted category. Illicit drugs are just market commodities, and users are just market participants. In terms of the individual's economic situation, we control for the household annual income before tax measured in Australian dollars using eight income bands as described in the Appendix, with the highest band being the omitted category. Although income may act as a social class proxy in the participation and misreporting decisions, the amount of consumption is likely to be directly related to the level of income as it is with any normal good. We also use individual's main labour market status to control for their economic situation, that is, employed, studying, unemployed, and other activities such as retired, on a pension or performing home duties, which form the omitted category.
The criminal justice environment is an important determinant of drug participation and consumption. At the same time, it also increases the incentive to misreport. For instance, the fear of punishment may be heightened if users perceive that supplying accurate information could lead to legal repercussions. Australia has long-standing laws with regard to cannabis decriminalisation. South Australia was the first jurisdiction to implement an expiation system for minor cannabis offences in 1987. Under this scheme, simple cannabis offences such as possessing or cultivating small amounts for personal use are subject to minor penalties, although the sanctions for commercial dealings are rather significant. Similar expiation systems have since been introduced in other Australian states and territories, and yet, others have been gradually easing their laws against cannabis consumption in recent years. We therefore include in all three equations a variable to represent the decriminalisation status of cannabis use across the various Australian states and territories. We also control for any migration effect by using an indicator for whether the individual has migrated to Australia in the last 10 years. Any time trend in participation and levels of consumption is addressed using time indicators for the surveys.
Although the inherent nonlinearity in our model can help achieve identification, we impose exclusion restrictions to ensure that the model is identified on data. 4 We therefore include additional explanatory variables in the participation equation, which we believe do not directly influence misreporting behaviour. In particular, drug culture or peer drug use has been identified as an important risk factor for drug participation (see, e.g., Delaney, Harmon, & Wall, 2008;Kenkel, Reed III, & Wang, 2002;Pudney, 2004). We therefore include the variable "peer" in x r that represents the proportion of the individual's friends and acquaintances that use cannabis. This variable is excluded from the misreporting equation. Given evidence on the gateway effect of alcohol to harder drugs such as cannabis (Pacula, 1998) and the association between body piercing and tattoo procedures with risk-taking behaviours (see, e.g., Deschesnes, Finès, & Demers, 2006;Heywood et al., 2012), we also include in x r dummy variables indicating whether an individual started drinking alcohol at a young age (i.e., below the legal age of 18 years), and whether the individual has ever undergone a body piercing procedure or a tattoo procedure. These risk indicators are not expected to influence individuals' misreporting behaviour. We also include year dummies in x r and x y to represent any trend changes over time. Finally, an individual's attitude towards drug laws is very likely to influence his or her consumption. We thus include a dummy variable in x r and x y , which takes the value 1 if the individual believes that a small quantity of cannabis for personal use should be a criminal offence, and 0 otherwise. These regressors are also excluded from x m .
We also use some exclusion restrictions to help identify the misreporting equation by including some variables exclusively in x m . Following the previous literature, these mostly relate to the conditions under which the survey was administered, and therefore may potentially influence the extent to which individuals misreport, but not their participation or consumption levels. Specifically, we control (using indicators) for the following: if anyone else was present when the respondent was completing the survey questionnaire ("present"); if anyone helped the respondent complete the survey questionnaire ("help"); and the survey format ("survtype"; which takes a value 0 if the drop-and-collect method was used and takes a value 1 if the CATI or face-to-face method was used). These variables conform with the factors that have been associated with misreporting or misclassification in prior studies (see, e.g., Berg & Lien, 2006;Hoyt & Chaloupka, 1994;Kraus & Augustin, 2001;Lu, Taylor, & Riley, 2001;Mensch & Kandel, 1988;O'Muircheartaigh & Campanelli, 1998), although none of these studies have modelled misreporting explicitly. We also include as an instrument a variable indicating a general lack of trust in the survey, which we proxy by the percentage of compulsory questions left unanswered in the survey. Note that nonresponse rates in general with regard to the response variable were very low (under 2%), such that this is not likely to adversely affect our approach and/or findings.
Finally, in terms of consumption levels, a standard consumer demand framework applies with special characteristics for addictive goods (see, e.g., Becker & Murphy, 1988). We thus include standard demand-schedule own and cross-drug prices in x y . Other than cannabis price, we control for the price of a range of related drugs such as amphetamines, cocaine, heroin, alcohol, and tobacco, in light of the evidence that certain drugs act as either complements or substitutes to cannabis (see, for example, Cameron & Williams, 2001;Ramful & Zhao, 2009;Zhao & Harris, 2004).
Price series for cannabis, cocaine, amphetamines, and heroin are obtained from the Illicit Drug Reporting System (IDRS) (NDARC, 2009). They vary across Australian states and territories and by year. The IDRS collects such data predominantly from interviewing injecting drug users and key informants who have regular contact with illicit drug users but which may potentially exhibit coverage error (NDARC, 2009). In occasional cases where a price report is missing, it is constructed using information from the Australian Bureau of Criminal Intelligence, which was replaced by the Australian Crime Commission in recent years. The Australian Bureau of Criminal Intelligence/Australian Crime Commission is an alternative source for drug prices, which collects information on drugs through covert police units and police informants (ACC, 2010).
The advantage of using price data from the IDRS is that they are provided with unified measures and fewer missing observations. To be specific, the price of cannabis is measured in dollars per ounce, and the respective price(s) of amphetamines, cocaine and heroin are measured in dollars per gram. The data on alcohol and tobacco prices are obtained in the form of indices from the Australian Bureau of Statistics (ABS, 2010). All price and income series are deflated using the all-items Consumer Price Index (CPI) for individuals' respective states of residence. Clearly dependent upon the particular price series, there is a potential for measurement error here (especially with regard to the illegal drugs). Table 1 presents some summary statistics on the observed cannabis consumption. On average, around 89% of individuals identify themselves as current nonusers. Given the way the survey questions are asked, these self-identified nonusers or the build-up of zero observations will include genuine nonusers, recent quitters, infrequent users who are not currently consuming cannabis, and potential users who might use when, say, the price falls. More importantly, these self-identified nonusers may include misreporters who, out of embarrassment, social disapproval, and/or fear of repercussions, may prefer to identify themselves as nonusers, for example. Given (a) that users have incentives to misreport consumption and (b) that for users who report truthfully, the choices of consumption intensities are ordered, then this presents a good case for the DZIOP(C) model(s) in order to identify the different types of zero observations and their potentially differing driving factors. Note that there is also the possibility of over-reporting, particularly with regard to the intensity of consumption (possibly due to memory issues). However, there is evidence that over-reporting is rarely a problem when analysing self-reported drug use (see, e.g., Swadi, (1990), and references therein). Table 2 reports the estimated coefficients of the DZIOPC model. In particular, we report three sets of results corresponding to the three equations: participation, truthful reporting, and levels of consumption. Note that out of three correlation coefficients, only 12 (i.e., the correlation between the participation and misreporting equations) is strongly statistically significant (although the joint Likelihood Ratio test only marginally fails at 10%). 5 Turning firstly to the results relating to participation, we find that, consistent with existing evidence, increasing age, being married, living in a capital city, and being a new migrant decrease the probability of participation. On the other hand, being male, having started drinking at a young age, having a tattoo or body piercing, and being of Aboriginal or Torres Strait Islander background are associated with higher participation (see, e.g., Cameron & Williams, 2001;Deschesnes, Finès, & Demers, 2006;Ramful & Zhao, 2009;Saffer & Chaloupka, 1999). In terms of education, we find those with higher qualifications are more likely to report participation. However, we do not find evidence of labour market status or income effect on participation. Consistent with the literature, cannabis use among peers and the decriminalisation laws also have a positive impact on participation (see, e.g., Cameron & Williams, 2001;Delaney, Harmon, & Wall, 2008;Farrelly, Bray, Zarkin, & Wendling, 2001;Kenkel, Pudney, 2004;Saffer & Chaloupka, 1999), although support for tighter drug laws is negatively related to participation. Focusing on the misreporting equation, we find that age, being male, and living in a capital city are associated with a higher probability of truthful reporting whereas individuals from a single-parent household and of aboriginal status are more likely to misreport. Interestingly, we find higher level of education to be also associated with a higher probability of misreporting.  Note. + a,b with a, b ∈ (r, m,ỹ) correspond to the correlation coefficients across the participation (r), miseporting (m), and level of consumption (ỹ) equations, respectively. Robust standard errors are given in parentheses. * significant at 10% level; ** significant at 5% level.

The results
Although we might expect decriminalisation to increase honest reporting in light of reduced legal implications, it is nevertheless associated with higher probability of misreporting (presumably this variable is capturing other state/time effects not controlled for elsewhere).
In terms of the instruments in the misreporting equation, all four of them are statistically significant and negative. The results suggest that the presence of another person during the completion of the survey, or the provision of assistance during such, increases the probability of misreporting. Similarly, the CATI and face-to-face methods of interview (relative to drop-and-collect) also increases the probability of misreporting. Finally, if the individual demonstrated a lack of trust in the survey by, in general, refusing to give a response, he or she also had a higher probability of misreporting.
With respect to the levels of consumption, we find that being male, of aboriginal descent and being unemployed, all have statistically significant positive effects. Similarly, having started drinking at a young age, having a tattoo or body piercing, and peer drug use are also positively related to levels of drug consumption. According to the rational addiction model by Becker and Murphy (1988) , drug users are rational, forward looking utility maximizers who base consumption decisions on full knowledge of the consequences of addiction. Current consumption by a young adult raises the user's marginal utility of future use but also reduces the overall utility in the future, given that the rational user takes account of the addictive properties of drugs and their implications for future health and wealth. We thus allow for this nonlinear age-consumption relationship through a quadratic specification for age. 6 Our results indeed show evidence of an inverted U-shaped distribution of levels of consumption with age. In other words, at both ends of the age distribution, individuals are associated with lower levels of consumption. Having young children in the household, being employed or a full-time student, and having higher qualifications are all associated with lower levels of consumption. Although we do not find evidence of any significant impact of household income on participation or misreporting, we observe a general decline in the levels of use in the highest income groups.
Considering the price variables in the consumption equation (which act as identifying variables here), we find that the price effect of cannabis is positive and significant. It is important to note that the price of cannabis is strongly associated with quality (see, e.g., Cameron & Williams, 2001;Williams, 2004), and because we are unable to control for the price variation due to quality, a positive price effect could well be picking up the drug quality effect on participation. This counter-intuitive price effect is, however, also found for several competing models (such as the generalised ordered probit [GOP] and correlated ZIOP [ZIOPC]; see Table 3 and the following section) and is therefore not an adverse finding of the current approach per se. The level of cannabis consumption is also responsive to heroin and speed prices suggesting that cannabis is an economic substitute to heroin but a complement to speed. However, the price effects of cocaine, tobacco, and alcohol are all statistically insignificant. In summary then, with at least two price variables exhibiting high levels of significance and along with similarly strong identifying instruments in the participation and misreporting equations, we are overall confident in our model results.

Partial effects
As with any probability model, partial effects are generally more informative than coefficients. There are several sets of partial effects that may be estimated here. For example, one may be interested in the partial effects of an explanatory variable on probabilities such as the probability of participation, Pr(r = 1), the probability of misreporting, Pr(m = 0), the probabilities for the levels of consumption conditional on participation, Pr(ỹ = j|r = 1), and the overall probabilities for different levels of consumption, Pr(y = j).
In particular, we are interested in the probability of reporting zeros, as this forms the major contribution of our approach. The partial effect on the overall probability of observing zero consumption, Pr(y = 0), is the sum of the effects on the probabilities of the three types of zeros; that is, the probability of nonparticipation, the probability of misreporting, and the probability of zero-consumption arising from participants who are infrequent or potential consumers. Note that the explanatory variables of interest may appear in only one or two of x r , x m , and x y , or in all three. For comparison purposes, in Table 3, we also present results from a GOP model without any hurdles, where the boundary parameters are specified as a function of variables in x m that do not appear in x y . Standard errors of the partial effects for all models are obtained using the Delta method (Greene, 2008), and the effects themselves were computed numerically.
We report the partial effects on Pr(y = 0) (estimated at sample means) coming from these three sources in the correlated DZIOPC model in Table 3. For a further comparison, we also compare these results with partial effects estimated from a ZIOPC model that allows zero observations to come from two distinct sources: nonparticipation and infrequent consumption/misreporting; and, as noted, from a GOP model that does not explicitly model zero observations coming from different sources but allows for the boundary parameters inherent in the OP model to be a function of the zero-generating variables. 7 For the DZIOPC model, the overall partial effects are decomposed in three parts: nonparticipation, Pr(r = 0), with clearly participation being the mirror image of this; participation and misreporting, Pr(r = 1, m = 0); and participation, truthful reporting,
Interestingly, we observe some important differences across the estimates from the alternate models for some explanatory variables such as living in a capital city, household income, and education. A key example is the effect of education. The ZIOPC model indicates that those with higher qualifications have a lower probability of nonparticipation but a higher probability of participation with infrequent consumption. With an additional misreporting dimension in the DZIOPC model, we find that those with higher qualifications also have a higher probability of misreporting. For instance, from the ZIOPC results, relative to those with less than year 12 qualifications, degree holders have a 2.0 percentage point (pp) lower probability of being a nonparticipant and a 1.6 pp higher probability of being a participant with zero consumption, resulting in an overall 0.4 pp lower probability of observing zero consumption. The DZIOPC results, in contrast, indicate that degree holders have a 2.6 pp lower probability of being a nonparticipant, a 1.1 pp higher probability of being a participant with zero consumption and also a 0.6 pp higher probability of being a misreporter. Overall, degree holders have a 0.9 pp net negative effect on the probability of observing zero consumption relative to those with less than year 12 qualifications. Finally, basing policy advice on the GOP model results, one would conclude that education has no impact on cannabis consumption, presumably with the opposing effects being cancelled out.
The effect of decriminalisation also highlights the potentials of the DZIOPC model. For instance, the GOP indicates that decriminalisation does not affect consumption. From the ZIOPC model, we find that decriminalisation is associated with higher probabilities of participation, whereas the DZIOPC suggests that an easing of the criminal justice system is actually associated with both a higher probability of participation and a higher probability of misreporting.
In short, as a result of comparisons of models that ignore any potential misreporting effects, it appears that such models result in biased estimates of various quantities of interest, and potentially erroneous policy advice.

Robustness checks
There may be concerns that certain control variables may be endogenous and/or that some of the instruments we use to identify the model in various equations may have effects elsewhere in the model. 8 In this section, we perform a robustness check to test whether our results change significantly with the exclusion of the so-called problem variables and alternate specifications. We look at four different specifications that we compare with our main one: Spec 1 -we enter price variables in both the participation and the consumption equations; Spec 2 -we run the main specification without tattoo and body piercing; Spec 3 -we run the main specification without peer influence; and Spec 4 -we run the main specification with only survey type and trustas identifying variables in the misreporting equation.
Comparing across the resulting partial effects, we find that the results are generally robust to the various specifications (Table 4). For example, the partial effect of male is −0.025 from the main model. This effect varies between −0.023 and −0.030 across the four alternative specifications. Similarly, the partial effect of having a degree varies in the range −0.004 to −0.009, comparable with the estimated partial effect of −0.009 from the main model. With regard to the results for the respective misreporting equations, the results essentially remained unchanged (and are available on request).
Different specifications for the misreporting equation were also experimented with; for example, it could be argued that the peer influence and support for criminalisation variables also affect the misreporting process. These results are not presented here (and are available on request), but again essentially did not significantly affect the overall results. Although we argue above that a positive effect of own-price on consumption levels might be picking-up quality effects, it could also be argued that there is a potential for reverse causation here. The same arguments could be made about the findings with regard to decriminalisation. In light of this, we also experimented with replacing these variables with both year and state dummies. Again, although not presented here, once more, the general results were not unduly affected.
We note here that there is also information available in the data regarding lifetime cannabis use and also use in the last month. One would expect misreporting levels to be less for the former and stronger for the latter (Brown, Harris, Srivastava, & Zhang, 2017). However, unfortunately, these are simple zero/one responses, so that they do not fit into the framework proposed in this paper.

Note.
Robust standard errors are given in parentheses. * significant at 10 % level; ** significant at 5% level. The four specifications are similar to the main one except for the following: Spec 1 -price variables appear in both participation and consumption equations; Spec 2 -tattoo and body piercing are dropped from the main specification; Spec 3 -peer influence is dropped from the main specification; and Spec 4 -only survey type and trust are used as instruments in the misreporting equation.
Finally, although we do report the results for reasons of space, we also conducted a series of Monte Carlo experiments based on the empirical data and specification used in the paper. In short, the model performed extremely well (with regard to estimating all key quantities of interest) under a range of scenarios and did not suggest any identification issues. A strong finding was that it appeared most important to ensure identifying variables in the misreporting equation. Another very strong finding, on the other hand, was that the restricted submodels (OP, GOP, ZIOP) yielded heavily biased estimates of some key quantities of interest.

Predicted probabilities
A key output from such a model relates to summary predicted probabilities especially with regard to the zeros. Thus, there are several predicted probabilities that will be of interest with the DZIOP class of models. For example, one may be interested in the partial probability of participation, Pr(r = 1). In terms of misreporting, one may be interested in the partial probability of misreporting, Pr(m = 0); or the joint probability of participation and misreporting, Pr(r = 1, m = 0); or the probability of truthful reporting, conditional on participation, Pr(m = 1|r = 1). Similarly, there is a range of probabilities one may be interested in predicting levels of consumption. However, our main interest in this paper is on the misreporting dimension. Therefore, to gain insight into the sources of the observed zeros, we present in the first row in Table 5 the predicted probability (averaged across all individuals) of the zeros and its three respective components (using the equations presented above): nonparticipation, misreporting, and zero consumption. We find that the overall predicted probability of 88.8% of observed zero consumption in the population is made up of the respective probability of 81.5% nonparticipation, 4.4% misreporting, and 2.9% infrequent consumption.
Such probabilities can be thought of as prior probabilities. That is, they apply to a randomly selected individual from the population, about whom we know nothing except for their characteristics. However, to provide further insights into the extent of misreporting, it is possible to estimate posterior probabilities, analogous to those considered in latent class models (Greene, 2008) that are conditional on the outcome chosen by the individual. This specifically allows us to make a prediction on what percentage of these zeros come from nonparticipation, misreporting, and zero consumption, respectively, using all the information we have on the individual. It therefore attempts to answer the following question: given that an individual recorded a zero, what is the probability that he/she is a true non-participant or a misreporting participant or an infrequent consumer (given their observed characteristics)? The posterior probabilities for the three types of zeros are given as (Greene, 2008): Pr Note that we have used the uncorrelated DZIOP model in the above. Note also the above probabilities are all conditional on y = 0. As a result, the numerator on the right-hand side of all three equations should be joint probabilities with y = 0. For example, the numerator in Equation 13 is strictly f(r = 0, y = 0|x). However, this can be simplified as above given y = 0 when r = 0. The same applies in the other two equations given y = 0 when m = 0 or whenỹ = 0. From Table 5, we find that about 72% of the reported zeros come from genuine nonparticipation, 17% from those who have misreported their participation, and 11% from zero consumption participants (estimated individually and averaged across). Moreover, the small estimated standard errors on these quantities is an indication that they have been estimated relatively accurately. These are important findings suggesting that misreporting and infrequent users reporting zero consumption in survey data with a fixed time frame (12 months here) may lead to considerable underestimation of drug use prevalence.

Conclusions
When modelling "social bads", such as illegal drug consumption, researchers are often faced with a dependent variable characterised by a large number of zero observations. Such zero observations could result from individuals misreporting activities regarded as being socially undesirable, illegal, or which are associated with perceived social stigma, as is the case with drug-consumption. The accuracy of the information gathered from surveys is therefore crucially dependent on the respondents providing reliable and accurate responses. If not, important behaviours will be misclassified thereby masking the true incidence of such. Thus, if ignored, misreporting potentially leads to inaccurate estimates of the prevalence of such behaviours and ultimately may lead one to question the validity of any conclusions drawn on the basis of these, which in turn raises concerns as to how useful such data actually is to policy-makers.
Building on the recent literature on hurdle and double-hurdle models, we propose a double-inflated modelling framework, where the zero observations are allowed to come from the following: nonparticipants; participant misreporters (who have larger loss functions associated with a truthful response); and infrequent consumers. Due to our empirical application, the model is derived for the case of an ordered discrete dependent variable. However, it is similarly possible to augment other such zero-inflated models (e.g., zero-inflated count models, and double-hurdle models for continuous variables). The model is then applied to a consumer choice problem of cannabis consumption.
Overall, the results suggest that misreporting has a significant effect on the incidence of cannabis use. Specifically, the posterior probability of misreporting cannabis participation is estimated at 0.17. In other words, out of all the reported zeros, 17% can be accounted to misreporting. The model also predicts that 11% of the reported zeros for cannabis use are from infrequent users with zero consumption ("corner solution" individuals), and only 72% are from true nonparticipants. The modelling framework also provides important insights into the drivers of misreporting in surveys compared to more standard modelling techniques. For example, a zero effect in a simple model might well disguise significant composite effects that simply net each other out. Interestingly, the findings also suggest that the extent of misreporting is influenced by how the survey was administered as well as factors such as the presence of other individuals when the survey was completed, and the individual's general trust in such surveys. In order to enhance accuracy of information gathered from surveys, it is therefore important to pay attention to the conditions under which the survey data are collected. The findings suggest that accounting for misreporting is important in the context of using survey data related to sensitive activities, especially where such data are used to inform public policy.