Appendix for online publication

Table A.1 expands the balance table in the main paper for the full set of baseline covariates available and used in the treatment effects regressions.1 Column 1 reports the sample mean for each covariate, and Columns 2 to 7 report the coefficients and p values on treatment indicators from ordinary least squares (OLS) regressions of each baseline covariate on three treatment indicators (one for assignment to each treatment arm) controlling for block fixed effects. Column 8 reports the p value from a joint test of significance of the three coefficients. Finally, at the base of the table we report the p value from a test of joint significance of all covariates from an OLS regression of each treatment indicator on all covariates (including that treatment group and the control group alone).

.1 expands the balance table in the main paper for the full set of baseline covariates available and used in the treatment effects regressions. 1 Column 1 reports the sample mean for each covariate, and Columns 2 to 7 report the coefficients and p values on treatment indicators from ordinary least squares (OLS) regressions of each baseline covariate on three treatment indicators (one for assignment to each treatment arm) controlling for block fixed effects. Column 8 reports the p value from a joint test of significance of the three coefficients. Finally, at the base of the table we report the p value from a test of joint significance of all covariates from an OLS regression of each treatment indicator on all covariates (including that treatment group and the control group alone).
Of 171 coefficients (57 covariates and 3 treatment arms), 14 (8.2%) have a p < .05, and 28 (16%) have a p < .1. Within treatment arms the covariates are not jointly significant, as seen from the joint test reported at the base of the table. Furthermore, 12 (21.1%) of the tests of joint significance have a p < .1. Table A.2 repeats the same balance analysis for the 947 subjects interviewed at endline. Of 171 coefficients (57 covariates and 3 treatment arms), 13 (7.6%) have a p < .05, and 20 (11.6%) have a p < .1. Within treatment arms the covariates are not jointly significant, as seen from the joint test reported at the base of the table. Furthermore, 12 (21.1%) of the tests of joint significance have a p < .1.
Overall, therefore, there is minor imbalance. We control for all baseline covariates in all treatment effects regressions in the paper to account for this. Table A.3 describes each of the study neighborhoods where we recruited, along with population estimates. We report the estimates of the number of all adult males, as well as our low-end estimates of the number of target males in each neighborhoods-men 18 to 35 in the bottom decile of income.

A.3 Tracking and attrition
We achieved tracking rates of roughly 93% over a year. 2 Given that this was such a transient population, we took special measures to minimize attrition. 1 We maintained the Phase 1 baseline survey for all Phases for the sake of consistency and completeness. 2 Rates of 80, 90 or even 95 percent are not uncommon in developing country field experiments and panel surveys. For example, the Indonesia Family Life Survey reached 94% of households and 91% of target individuals after four years. The Kenyan Life Panel Survey made contact with 84 percent of target respondents over a seven-year period. Similarly, in the US, researchers were able to reach 98% of the Perry Pre-school children at age 19 and 95% at age 27. One reason is that a small sample is easier to track intensively. Another reason is that enumerator wages are lower in Liberia in the U.S. and this means that intensive sleuthing and tracking is affordable.
i ii (2)-(7) report the coefficients and p values from ordinary least squares regressions of each baseline covariate on three indicators, one for assignment to each treatment arm, controlling for block fixed effects. Column (8) reports the p value from a joint test of statistical significance of all three treatment indicators.
iii vi Tracking to reduce attrition At baseline we were clear about our desire to stay in touch. We took photos and signature samples, and collected as many as ten different ways to contact each respondent. We documented contact information for each respondent, including all the places they said they sometimes stay, plus contact information for the network of people around them who have a more stable location. Respondents were often on the run from the police or other people, and so their contacts might be uncomfortable speaking to enumerators and revealing the respondent's location. Thus, after the baseline survey, we asked respondents to use the enumerator's phone to call their most stable contact and introduce the enumerator and study and give permission.
At each endline, enumerators would typically start with the phone numbers of the various contacts or respondent and try to arrange an appointment. Contacts received no financial incentive. Failing that they would begin visiting the various locations listed. A slight majority of respondents were found within a few hours. In other cases, all leads were cold and more extensive sleuthing and asking around the neighborhood was required. If someone had traveled or moved far away, enumerators either waited until they returned or traveled across the country to find them in person.
On the upper tail, it could take three to four days of physical searching to find the hardest-to-locate people. Enumerators only stopped searching when all possible leads had been exhausted.
Response rates Table A.4 lists survey response rates by treatment group and survey wave (pooling the 2-and 5-week surveys, and pooling the 11-and 13-month surveys). It also reports the p-value from a t-test of the difference between the response rate in each treatment group and the control group. None of the differences are statistically significant, and all are within about a percentage point of the control group response rate. The control group response rate is a tiny bit lower in the 12-13-month surveys and a tiny bit higher in the short run ones. But none of these differences control for covariates or even strata fixed effects, as in the next table.
Correlates of attrition and compliance We analyze the correlates of attrition in Columns 1 and 2 of Table A.5, which reports an OLS regression of an indicator for attrition on selected baseline covariates. 3 There are not significant differences in attrition by treatment group, substantively or statistically. Those who attrit are slightly wealthier and have slightly poorer mental health. In all, the treatment indicators and covariance are jointly significant at p = 0.047 so attrition is not ignorable. This is one reason we control for covariates in all treatment effects regressions.
A.4 Treatment compliance Figure A.1 displays the distribution of class attendance for those assigned to therapy. NEPI did not collect attendance data during the first week (three sessions), so for simplicity we assume that all participants who attended at least one session after week one also attended the first three sessions.
We use two definitions of compliance. Our first measure is defined as "attending at least 8 days of therapy", or about three of the eight weeks. Our second measure is defined as attending at least 80% of sessions (16 classes plus the 3 in the first week).
We analyze the correlates of compliance in Columns 3 through 8 of Table A.5. Being assigned to cash in addition to therapy did not affect the likeliness of attending therapy, which is to be expected since the cash grants were not known to participants until after therapy. The main correlates of Notes: Survey response rates are calculated as the difference between the total number of respondents at baseline and the number of respondents "unfound" at each endline, all divided by the number of respondents at baseline. Here, "unfound" refers to both respondents we could not locate and those we did locate but who choose to not participate in the survey.
compliance in the first three weeks are higher education, higher initial antisocial behaviors, and higher self-control skills. The main correlates of attending at least 80% of the sessions are higher education, better mental health, and patience in game play. Higher initial antisocial behaviors, and higher self-control skills are no longer so relevant.

B.1 Power calculations
After completing the pilot, we decided on a target sample of 1,000. This target was based on maximum program capacity and financial constraints. Based on the pilot, we estimated that the Minimum Detectible Effect for the full 1,000 (with a quarter for each treatment) would be a 0.12 standard deviation change in a standardized dependent variable for a two-tail hypothesis test with statistical significance of 0.05, statistical power of 0.80, an intra-cluster correlation of 0.25, and the proportion of individual variance explained by covariates as 0.10.

B.2 Randomization protocols
For the therapy and cash randomization, men in each block took turns drawing colored chips from an opaque fabric bag. In general, the bag was shaken and then the subject was instructed to turn viii ix Figure  attendance data was collected during the first week, so we assume for simplicity that all participants who attended at least one session after week one also attended the first three sessions.
away and to place one arm into the bag and to draw out a single chip. The color was confirmed and recorded.
In the cash instance, men were randomized in roughly equal sized blocks of about 50 people. Each man was invited into a private room to draw to ensure privacy and safety. This procedure was explained to the entire group, and all chips were placed into the bag in front of everyone. Then the bag was taken into a private room, and participants were called into the room individually. If they wished, they could inspect the bag to confirm that there were still chips of both colors inside. After everyone present had drawn, staff drew the remaining chips for the no-shows.
In the case of therapy, men were randomized each day, according to how many were recruited and surveyed in that neighborhood. This led to blocks ranging in size from 1 to 20, though the vast majority of blocks contained roughly 7 to 15 people. The draw was not as private as the cash draw, and men observed the outcomes of others drawing at the same time. Those who lost in the therapy randomization were offered a free meal along with the opportunity to discuss their situation with someone, and they were transported to a location of their choosing. A small percentage of the men were visibly upset and refused to engage at this point.

B.3 Therapy
NEPI's standard curriculum tended to be longer and broader than the two noncognitive skill and value changes that we study. For the purposes of this study, we worked with NEPI to streamline and focus the traditional STYL curriculum in two ways. First, we further grounded the approach in terms of CBT, emphasizing more practice over lectures. In general these modifications were quite modest, since the program already incorporated these techniques. Second, we asked NEPI x to exclude modules not relevant to their theories of change: interpersonal skills; conflict resolution skills; dealing with war trauma and PTSD; career counseling; and community leadership.
To clarify and validate NEPI's curriculum, a Liberian qualitative researcher acted as a participant observer throughout one of the two Phase 1 pilot classes. Based on NEPI's training materials, our analysis of the theoretical grounding of the therapy, and this participant observation, we and NEPI developed a full program manual for the intervention. 4 The manual details the history and theory of the interventions, guidelines for recruitment of trainers and participants, training suggestions, the full curriculum, and guidelines for out-of-classroom engagement.

Curriculum
The curriculum has eleven main modules, which we present here with some examples of goals and activities: 1. Transformation. A tenet of CBT is that the therapist explicitly sets goals with participants and lays out the therapeutic strategy. This module introduces the concept of transformation, its significance, and the processes involved in transforming oneself.
• The men are introduced to the techniques that will be used (role playing, lectures, storytelling, etc.), homework assignments, home visits, and the reasons for each.
• The module also introduces ground rules for behavior, in terms of being respectful, practicing listening, waiting your turn, etc. The men do not necessarily have these skills, or haven't exercised them in some time, and learning to abide by these behavioral rules is an important part of the therapy.
• Facilitators also begin to teach the songs, slogans, and call-and-response that will be used repeatedly throughout the course. These songs and slogans serve as important reminders of rules of behavior for the men to follow. They also can be used to bring order to a disorderly or inattentive group.
• There are symbolic rituals to indicate a break in their lives. For example, the men write their "street names" and aliases on sheets of paper and they are burned together.
2. Substance Abuse. This module defines substance abuse and discusses its ill effects, as well as steps for moving past it. It explicitly encourages participants to reduce their consumption of drugs, alcohol, and tobacco. They are cautioned against cutting drugs entirely, to avoid withdrawal problems.
• Men talk through and list reasons that they use drugs. The idea is to make them consciously aware of the reasons for their own behavior and risk factors in their lives. They also talk through the ill effects. Men talk through publicly about ways in which drugs have adversely impacted their own lives, sharing experiences.
• Men role play situations where they could be pressured to use drugs and practice strategies for saying no.
• An outside speaker comes to the classroom, often a former graduate of the therapy, to talk about their experiences with drugs and what it did to their lives, as well as what strategies they used to emerge. Men discuss strategies they can use in their own lives. They practice some of these as homework and come back to discuss their experiences with the class.
3. Body Cleanliness. The module explores the health, psychological, and social benefits of maintaining body cleanliness. Participants are encouraged to change behaviors that alienate them, and to present a public image (such as hair and dress) that promotes positive social interactions with community members.
• Body uncleanliness is defined and highlighted as a problem mainly by getting men to discuss and volunteer their own opinions and experiences in a group.
• The facilitators bring in a hair cutter, an electric shaver, and a set of nail clippers for men to clean up if they like.

Garbage/Dirt
Control. An extension of the previous module, this module highlights the importance of cleanliness in participants' environments, and the ill effects of living in a dirty environment. It aims to help them maintain clean, healthy, and orderly living spaces.
• Facilitators present the men with pictures of dirty and clean homes, businesses, and streets, and men point out different risks and unclean elements, and discuss the consequences.
• Men identify ways they can improve cleanliness where they live (e.g. get a garbage can) and set and execute these plans as homework, to be followed up with home visits.
5. Anger Management. This module discusses the causes and effects of anger, the problems with acting out in ways they may later regret. It also provides participants with tools to manage their anger.
• Men discuss the signs and indications of anger, in themselves and others, through discussion and role playing. Facilitators show pictures of angry faces and situations, and men interpret them. The aim is to make them cognizant of these signs.
• Men discuss the causes of anger, and learn to link some of their actions to other people's anger.
• Men discuss and role play the negative consequences of aggression and violence, or share experiences from their own life.
• Men practice nonaggressive responses to angry confrontations in class, such as learning to distract or calm oneself (walking away, doing other activities, starting discussions and de-escalating, or practicing breathing techniques). Men practice these techniques as homework.
6. Self-Esteem. This module emphasizes the need for participants to discover themselves in order to begin the path to recovery. This module links their behavioral changes to respect, pride, and confidence.
• The facilitators try to link poor self-image directly to many of the behaviors they have discouraged in previous modules, both as a cause and consequence.
• Men discuss ways they can build self esteem, make plans, and execute them as homework.
xii • Facilitators work with men to identify worthwhile skills and characteristics they hold that are worthy of others' respect.
• Men practice shopping for goods in a supermarket or shop as one of the first exposure activities. They work through successes and failures as a group and try again, sometimes with the help of a facilitator.

7.
Planning. Reviews the steps and components necessary for planning and implementation. The goal of this module is to build participants' capacity to develop short-and long-term plans and understand the processes involved in executing these plans.
• Planning skills are commonly taught in CBT programs as a method to build new skills. At its most basic, this involves helping the men break down larger plans into smaller steps and helping them work through ways to accomplish those steps, positively reinforcing successes and helping them process challenges and setbacks, often as a group. Men give examples and discuss them together. Another example: Small groups of men are tasked with organizing activities, such as a football match. The larger group listens to the different plans and critiques them.
• As homework assignments, initially men are tasked with simple tasks (create a short term survival plan for feeding yourself or your family), and then more complex tasks (such as a business plan or home garden).
• Men are also tasked with identifying a successful friend or family member and determining what steps led to their success. A motivational speaker (usually a past graduate) is also invited to talk about the steps involved in their success and their learnings and setbacks.
8. Goal Setting. The module outlines tools participants can use to develop goals, objectives, and indicators for measuring success in their own lives.
• Participants are taught what short and long term goals are (through discussion and examples) and how to set reasonable short-and long-term goals (such as feeding their family, or starting a garden).
• First participants practice setting goals and making plans, and then the larger group discusses and critiques them. Participants then set their own small, short term goals (e.g. changing a behavior, reconciling with a family member, or saving a certain amount this week) and execute these as homework, processing successes and failures as a group.
• Participants discuss the characteristics of good goals (e.g. achievable, measurable, timebound) and revise goals and plans. They are given poor goals as a group and practice turning them into better goals. Another motivational speaker is used to discuss the role of goal setting in their own life.
9. Money Business. This module stresses the importance of engaging in positive spending habits and appropriately managing money. Impulsive spending habits are emphasized. Participants are taught to make plans and prioritize their needs and wants prior to spending their money.
• Men engage in exercises to track their own recent spending to see where their money has gone. They discuss the use and misuse of their own money. As a group they discuss regrets and bad decisions and work through the negative consequences. These are illustrated dramatically through role-playing and skits, followed by discussion.
• Later discussion, role playing and skits focus on techniques for resisting peer pressure and temptation. There is also testimony from a motivational speaker, usually a past graduate of the program.
10. Money Saving. The module introduces participants to various saving options and encourages them to reflect on the most suitable saving method for their lives. They practice interactions in informal and formal financial institutions.
• Men discuss the reasons for and advantages of saving and it is explicitly linked to positive self image and esteem in the community. There is another motivational speaker.
• Men learn techniques for saving safely at home without formal institutions. They learn to set and execute saving plans, using their goal setting and planning skills.
• Homework assignments involve saving money they would have otherwise used on things they regret (identified in the previous module). Homework also involves trips to the bank and informal lenders. Prior to these assignments they meet and role play in groups, and their strategies are discussed and critiqued by the larger group. There is also a focus on appropriate presentation and image in these outings.

11
. Challenges and Setbacks. The module explores potential challenges and setbacks they will face and has them practice positive coping mechanisms needed to effectively overcome them. Challenges and setbacks are framed as a test of one's maturity, potential, and abilities, and an opportunity for improvement.

A note on the approach
Note that in the United States, cognitive behavioral approaches to reducing violence are conscious of the fact that the values and behaviors it encourages could be maladaptive in some situations, since being violent can also protect people. As a result, these therapies teach people to judge when and where to use aggression. 5 NEPI, in designing the STYL therapy, did not consider the need for educating men on such contingent, adaptive behavior. Rather, their philosophy was that fighting back or retaliating in this context would lead to cycles of violence and an escalation of future risk, not a decrease. NEPI also emphasized how it was also important for the men who passed through STYL to demonstrate to the community that they were not aggressors or violent, to maintain the new image, and retaliation could be counter-productive there.

B.4 Cash grants
We contracted the international non-profit Global Communities (GC) to conduct the registration and cash distribution, as well as oversee NEPI's financial management and implementation schedule. We did so for several reasons: 1. To keep the therapy and the research teams distinct from cash distribution; 2. To coordinate registration and implementation of the two activities; 3. To relieve the research team of project and financial management of the interventions; and 4. To make the intervention as close as possible to a real-world, replicable intervention by other non-profit or state organizations.
For safety, GC developed a highly structured system of cash distribution. GC staff held cash in a car that moved around the neighborhood, to avoid theft. A lottery team with the men gave grant winners a voucher, and put them on a motorbike taxi that was then directed to the street corner where the car with the cash awaited. They were told to approach the car (which had an identifying mark such as a red bag on the dash), hand over their voucher, and receive their cash. The car would then move to a new corner, whose location would be relayed by mobile phone, and the process would repeat.
Anyone who was assigned to the cash treatment but was not present on the day of disbursal was still eligible for the grant. GC attempted to locate them for up to three weeks afterward, and generally succeeded.

C Formal theoretical model
Our model is rooted in previous models of occupational choice with self-employment (Fafchamps et al., 2014;Udry, 2010;Blattman et al., 2014), but adapted to have a criminal sector as in the broad class of models described by Draca and Machin (2015). We employed a similar model in Blattman and Annan (2015).

C.1 Setup
We model an individual's choice between legitimate business and illicit activities under different conditions-with and without time inconsistency, and with and without financial market imperfections-and assess the predictions for a number of common labor market and crime-reducing interventions: greater punishment, increasing productivity in legitimate business (e.g. through technology or skills improvement), cash or capital transfers, and interventions that shape preferences-either time preferences or personal preferences against illegal behavior.
We use L b and L c to denote time spent in legitimate activities (such as petty business) and illegitimate activities (such as crime). Legitimate business produces revenue according to production function F (θ, L b t , K t ), where θ is productivity or individual ability and K is accumulated capital used in business. A person's decision to participate in illegal activity is motivated by the potential gains and costs from such activity. Gains include the expected illegitimate payoff per hour spent in illegal activities, w. Costs include the possibility of apprehension and conviction, which occurs with probability, ρ, and implies a penalty, f L c t−1 . Thus the penalty for criminal behavior is a linear function of hours spent in criminal activities in the previous period 6 The individual's total expected earnings from legitimate and illegitimate activities are . In addition to investing in business, the individual can also invest or borrow through a riskless asset with constant returns 1 + r. At each period t, the individual decides how much to invest for next period a t+1 and reaps interests ra t from last period's investments.
Individuals have utility function U (c, l, σL c ), where c denotes consumption and l denotes time for leisure. We also allow for individuals to have direct disutility from engaging in crime, as measured by σL c , where σ > 0 implies that implies that illicit work induces some internal penalty such as shame, though in principle it could also reflect social penalties such as a loss of esteem or exclusion from peers and other social networks. We make the standard assumption that U c ≥ 0, U l ≥ 0, We allow for the individual to have quasi-hyperbolic (β, δ) preferences.
We first consider the case without any uncertainty. The individual's problem is:

Without credit constraints
Without time inconsistency (β = 1) or credit constraints, the set of optimality conditions are: where for ease of notation, we use U (t) to denote U (c t , l t , σL c t ) and F (t) to denote F (θ, L b t , K t ). Since we modeled crime punishment as a potential reduction in future wages, the risk neutral individual will view crime as an occupation with a discounted wage w t − ρf 1+r . To find the marginal conditions for engaging in each sector, we first consider the case where illicit activity is not feasible. This would arise naturally if the probability of apprehension is high enough and punishment is heavy enough that w ρf 1+r . In this case the decision to engage in business depends on productivity θ, wealth level and the returns on other financial assets r. We use c ba , L ba and K ba to denote consumption, labor and capital level in this scenario. Each period t, the individual 7 For ease of analysis, we also assume that the marginal return to capital is infinity for the first unit of capital invested in business, and that as long as there is positive capital input, marginal product of labor for the first unit of labor will be infinity, i.e. lim F K (θ, L b , K) = +∞ as long as K > 0. This assumption guarantees that investments and hours in business will always be positive.
xvi chooses L ba t to satisfy taking K ba t as given, and he chooses capital investment K ba t+1 to satisfy F K (θ, L ba t+1 , K ba t ) = 1 + r, taking expected L ba t+1 as given. Now, taking levels of c ba , L ba and K ba as given, we then look at individuals' decision to engage in crime. Individuals will engage in illicit activities if and only if: which says expected returns from crime are higher than the highest possible marginal rate of substitution between leisure and consumption the individual can achieve without engaging in crime.
Since −U σL m /U c > 0, a rise in σ means more people will drop out of crime.
If condition (6) is satisfied and if K t > 0, the individual then chooses L b t and L c t such that the marginal product of labor in business equals his expected marginal gains from crime, which also equals his marginal rate of substitution between leisure and consumption: i.e. conditions (1) and (2) will be satisfied. Notice L c t may not always be positive. The individual will not engage in crime if any or all three of the following happens: w t is very low relative to the probability of apprehension ρ and punishment f ; productivity in business θ is very high; the degree of aversion to crime σ is very high.
Capital investment and hours in business will satisfy condition (3). Notice that w, ρ and f will not affect returns to investment in business.
Interventions that increases the disutility of crime or the size or probability of punishment will reduce time devoted to in crime, but will have no effects on returns in business. 8 However, interventions that increase business productivity θ will not only induce more investment in business, but also reduce involvement in crime. In other words, ∂L c ∂σ < 0, ∂L b ∂σ is ambiguous, ∂L c ∂θ < 0 and ∂L b ∂θ > 0. Finally, interventions that provide capital or liquid financial assets, such as a cash windfall, will not affect occupational choice at all, since the individual will already be working at his optimal level in both sectors. The windfall will simply be consumed and saved.

With credit constraints
In this section we consider the model with a simple credit constraint in the form of a t ≥ 0-individuals are unable to borrow in any period. We focus our attention on individuals whose initial a 0 is low enough that at some point in his life, the credit constraint is binding. Credit constraints will affect optimal conditions (2) and (3). The optimal condition for capital investment (3) becomes and the optimal condition for hours in crime (2) becomes 1+r . For the impatient individuals whose 1 δ >1+r, their optimal level of capital investment will be lower than the baseline case because 8 The level of investment in business may change depending on the shape of the utility and production functions, but the returns to investment will not change.
xvii of the credit constraint. They are also have a higher expected returns from crime than in the baseline case, because the low level of business investment also forces them to put a higher discount rate on potential future punishment from crime.
Critical condition (6) becomes Credit constraints induce more individuals who would otherwise not engage in crime to commit crime. For the impatient individuals, credit constraints increase their hours in crime and reduce their capital investments and hours in business activities.
Interventions that ease the credit constraint, including cash windfalls, will induce more investment in business and reduce involvement in crime. As in the baseline case, ∂L c ∂σ < 0, ∂L b ∂σ is ambiguous, ∂L c ∂θ < 0 and ∂L b ∂θ > 0; however, the magnitude the effects of a change in σ or θ will be greater than in the baseline case; the magnitudes also increases with the degree of impatience: |∂L c /∂σ| dδ < 0, |∂L c /∂θ| dδ < 0 and |∂L b /∂θ| dδ < 0 (notice that the lower the value of σ, the more impatient the individual).

Without credit constraints
Time-inconsistent individuals (β < 1) will be more reckless in the present. Intuitively, the smaller is β, the more individuals want to enjoy higher consumption today at the expense of future consumption, which means they will borrow more, save less, invest less in business and/or involve more in criminal activities. However, as long as there is a perfect financial market, no one will change their business or criminal activities in order to consume more today-they will simply borrow more (or save less) today through the financial market.
In terms of optimal conditions, in the absence of any credit constraint, the only condition that changes is equation (4), which becomes where W t denotes total wealth at time t, c P t+1 denotes the individual's predicted future decision about c t+1 at time t. For the sophisticates c P t+1 = c t+1 while for the naifs c P t+1 > c t+1 . Compared with the baseline case, the discount factor δ is replaced by the effective discount factor ∂c t+1 ∂W t+1 βδ+(1− ∂c t+1 ∂W t+1 )δ, a weighted average of the short-run and long-run discount factors βδ and δ where the weights are the next period marginal propensity to consume out of total wealth. Notice that neither condition (2) nor condition (3) changes, as long as we have no credit constraints. Compared with the baseline, time inconsistency alone will not affect criminal activities or business investment. It would only change the level of savings or debts.
In this case, interventions that aim to correct time consistency will have no effects on either business investment or criminal activities, but will have an effect on consumption, savings and income.
Compared with the baseline case, τ > 1 + r as long as an individual is credit constrained (i.e. has no savings). The level of τ will be higher for the sophisticates than for the naifs. However, regardless of their level of sophistication (i.e. the way individuals set their expectations for their future behavior), we know for sure that τ > 1 δ , and the smaller β is (i.e. the more time inconsistent), the higher τ will be.
Compared to the time-consistent credit constrained case, fewer individuals will invest in business, more individuals will engage in crime, business investment levels will be lower, and hours in crime will be higher for everyone. The difference increases with the level of inconsistency (i.e. decreases with β).
Interventions that improve time consistency will shift people away from crime towards business. So will increasing the disutility of crime (though, as in the case without time inconsistency, while ∂L c ∂σ < 0, ∂L b ∂σ is ambiguous). Increasing business productivity will have similar effects as before: ∂L c ∂θ < 0 and ∂L b ∂θ > 0. In all of these cases, however, the magnitudes the effects of a change in σ or θ will be greater than under time consistency, and the magnitudes also increase with the both degree of impatience and the degree of time inconsistency: |∂L c /∂σ| Notice that the lower the value of β, the more time inconsistent the individual is, and similarly, the lower the value of σ, the more impatient the individual is.

C.4 Introducing uncertainty and risk aversion
Three potential sources of risk are uncertainties in business productivity θ, wages from criminal activities w, and the potential punishment after apprehension f . We assume that decisions on business investment and hours in both sectors are made before risks are realized, and that θ, w and f follow independent stochastic processes.
With uncertainties in both the business and illicit sector, business investment and hours in both sectors depend on the variance of returns in both sectors and the level of initial wealth a 0 . If both xix sectors are sufficiently risky, then those with high levels of wealth a 0 will turn away from both activities by reducing K, L b and L c and investing instead in other riskless assets. K, L b and L c will all be lower than the cases without risk. Those with low levels of initial wealth will not be able to live off savings alone, so they will have to invest more in either or both sectors, depending on the relative riskiness of the two sectors. As long as both sectors are similarly risky, K, L b and L c will all be higher; otherwise, if one of the sectors is less risky than the other, individuals will invest more time in that sector. L c L b +L c will be lower than in the case without uncertainty if returns to crime are more volatile than business returns. One special case would be if individuals face a significantly positive chance of death after committing any crime. This is the equivalent of saying f = +∞ with strictly positive chances. In this case hours in crime will be reduced to zero as long as the probability of apprehension is positive, ρ > 0.
With the presence of risk, inventions in θ will have greater effects, because an increase in θ now also makes business relatively less risky. A rise in σ will also have a bigger effect than without uncertainty, because risk aversion will reinforce the rise in aversion and further reduce hours in crime.

D Measurement
In this section, we discuss measurement decisions in more detail, including what was and was not specified in the 2012 National Science Foundation (NSF) proposal 1225697 that substitutes for the absence of a pre-analysis plan. 9 Section 4 of the proposal provides a numbered list of hypotheses and primary outcomes, and (roughly) how we planned to operationalize them, especially Sections 4.1 and 4.4. Section 5 expands on measurement approaches, both for these primary outcomes, as well as for control variables and other outcomes of interest. These are the key sections to examine now. Section 4 in particular is the basis for our organization of the current paper. That section and the introduction (Section 1) not only emphasize particular primary outcomes, but also the division into ultimate and intermediary outcomes.
We also report control group means and treatment effects on all of the survey questions that enter an index in the main tables. A note of caution: the standard errors have not been adjusted for multiple hypothesis testing, and so patterns across treatment effects within an index are suggestive only. Table D.1 displays treatment effects for all components of our antisocial behaviors index. 10 These are purely illustrative, and we do not adjust standard errors for multiple comparisons.

D.1 Antisocial behaviors
Sections 1 through 4 of the NSF proposal make the primary, ultimate outcomes fairly clear: "poverty" and "violence", where Section 4.1.C defined "violence" as "crime, aggression, and political violence". As discussed in the main paper, only political violence was later dropped because none occurred before endline. We renamed this collection of outcomes "antisocial behaviors," for generality and clarity.
We are not aware of existing scales or measurement tools for Liberia, or even similar populations in sub-Saharan Africa or other low income countries. Thus, in general, our variables grew out of months of field work, qualitative interviews, and survey pre-testing by the authors and their research assistants, in order to understand common offenses and behaviors. Liberians speak a pidgin English and street youth have a slang of their own, and so even where we began with common scales (such as aggressive behaviors) the wording had to undergo extensive translation and testing to make sense. We also added new aggressive behaviors common to the study population and Liberian culture. Table 3 of the main paper reports all measures of economic performance, and we do not replicate it here. The NSF proposal emphasized "poverty" as one of the two ultimate outcomes of interest, and in Section 4.1 expanded on this to discuss the expected impacts of the therapy on "economic decisionmaking and outcomes", including "levels of business investment and expenditures, savings, income and assets/consumption". The table in that proposal focused on business investments and our three measures of income (consumption, earnings, and asset stock). We look at all these measures of economic performance in a single family index, but would draw the same conclusions if we took a narrower definition of poverty and focused on the income measures alone, or even consumption alone. Table D.2 displays control means and 12-13-month treatment effects for all subcomponents of our forward-looking time preferences index. The summary index consists of eight equally-weighted components: four measures for patience (δ) and four measures for time inconsistency (β). Components come from incentivized game play, hypothetical trade-offs over time, and survey measures.

D.3 Time preferences
The NSF proposal outlined these measures fairly specifically. Section 4.1.A specified our interest in the malleability of present versus future orientation and time inconsistency, and Section 4.4.A operationalized these measures as incentivized intertemporal choice games; hypothetical intertemporal choice games; and self-reported preferences. The main source of ambiguity was that the proposal referred to these measures variously as "discount rates", "present bias" or "forward-looking behavior".
In the end, the survey/incentivized games collected four types of measures, and each one yields a proxy of patience and time inconsistency

Incentivized trade-offs
Following the survey, subjects were asked to play a set of "real money games" where they had to make a series of intertemporal choices between money at one point in time versus more money later in time, with some probability of a payout. The average payout was about $3, roughly a day's wages. 11 The first choice was between money now and more money in two xxiii weeks; second between two weeks and four weeks; and finally one more question for each of these pairs of delays, but with the numbers modified depending on their first answer (i.e. if they chose to wait, then they were asked again but with a lower reward in the future). This bifurcating design allowed us to glean as much information as possible about their preferences with as few questions as possible, and we pretested the potential payouts to maximize the variance in responses.
Based on game play, we assigned present and future patience scores for each respondent, ranging from 0 (less patient) to 3 (more patient). 12 We then used the sum of patience scores from the games to put people into 7 increasingly patient bins (0-6), and the difference of scores to put people into 7 increasingly time inconsistent bins.

Hypothetical trade-offs
During the survey questionnaire, well before the incentivized games, we asked respondents to make the exact same series of tradeoffs as above, but in a purely hypothetical setting. We constructed the patience and time inconsistency proxies in exactly the same manner. Our aim was largely methodological, as we were interested in whether people responded differently when games were incentivized rather than hypothetical. This analysis-comparing the consistency and comparability of time preferences over different measures and over time-will be the subject of future methodological work, based on similar data we have collected across several countries and populations. 13 In the meantime, we merely use all available time preference measures in our summary index, in the interest of reporting all survey measures used from each family.

Hypothetical discount rate
We also attempted to measure the discount rate in a second way (again, mainly for the methodological study mentioned above). As in Holt and Laury (2002), we asked respondents a series of hypothetical inter-temporal choices for larger amounts of money (on the order of US$10-30, about a week's wages). This was organized as two lists of 11 binary decisions, with a fixed amount right now versus a varying amount in two weeks (or two weeks versus four weeks for the second list). The delayed amount started as strictly less than the sooner amount (e.g. 1000 LD now or 900 in the future), then equal to, and then larger and larger until it was four times as big (1000 LD now or 4000 LD in future).
We calculated discount rates based on each respondent's first switch from a present preference to a future preference. 14 Those who preferred 900 LD in the future over 1000 LD in the the survey team would return. By the endline stage (their fifth survey with us), respondents knew us fairly well and knew that we were able to track them (and that we had paid them everything we had promised them in the past). In fact, for logistical reasons, we also made one of the games a choice between a certain payout now and a lottery between a high and low payout (i.e. a risk preference question) and we selected this risk game for payout with very high probability, such that the intertemporal games were almost never paid out Although we did not technically lie at any point (since we did not mention the probabilities that each task would be paid out) this could be construed as minor deception. None of the respondents brought this up, even after having gone through the process five times.
12 For example, if a respondent preferred 150 Liberian dollars (or LD, where 1 USD = 60 LD at the time) in a week over 50LD now, and 100 LD in a week over 50 LD now, they received a 3 for their present patience score. If they preferred 50 LD in two weeks over 150 LD in three weeks, and 50 LW in two weeks over 300 LD in three weeks, they received a 0 for their future patience score. 13 In the meantime, we can see that the means similar (3.96 for the incentivized game versus 3.35 for the hypothetical), but this 15% difference is statistically significant at the 99% level.
14 Enumerators continued down the list, and (oddly) a nontrivial fraction switched multiple times. We use the first switch only. Furthermore, about 17% of respondents preferred less money in the future as a commitment device, especially if they were expecting a large purchase coming soon. xxiv present received a discount rate of .9, while those who always preferred money earlier received a discount rate of 4. We then took the average of the inverse of the present (now versus 2 weeks) and future (in 2 weeks versus 4 weeks) discount rate as our measure of patience, and the difference between future and present as our measure of time inconsistency.

Self-reported survey questions
We asked respondents six qualitative questions to gauge their self-reported levels of patience and time inconsistency. 15 For example, respondents were asked to place themselves on a ladder from 0 (least patient) to 5 (most patient) as one measure of self-reported patience, and how much they agree with statements such as "When I get money, I spend it quickly" as a proxy of time inconsistency. Specific questions are displayed in Table D By reporting all measures collected in the endline survey, three-quarters of our time preference measures are hypothetical rather than based on incentivized games. For robustness purposes, in Table D.2 we also report a summary index of the incentivized games only. Table D.3 displays control means and 12-13-month treatment effects for all subcomponents and survey questions in our self-control index. Again, these treatment effects are for illustrative purposes only.

D.4 Self control skills
The NSF proposal highlighted "noncognitive skills of self control" as one of the three primary intermediate outcomes of interest in Section 4.1, though the proposal sometimes used "impulse control" or "self-discipline" synonymously. Sections 4.1.A and B gave examples such as "inhibition control, executive function, and perseverance". Section 4.4.A added that we will measure this using "standard psychological skills such as conscientiousness, locus of control, working memory, and inhibition." This ex ante description included executive function and locus of control, neither of which we presently include in family of self control measures in the paper. We decided to exclude these two measures after the NSF proposal was written but before final data collection. While this decision is not formally documented, the psychological and neurological principles are clear.
• While economists often refer to executive function as a "noncognitive" skill, one associated with self control, it is a technically cognitive ability in that psychologists and neuroscientists view it as a measure of mental performance established at a young age. Psychological and neuroscientific research suggests that executive function responds to childhood but not adult investments, and that investments result in very task-specific changes. Thus both theory and evidence suggests that this neurological capability should not be affected by CBT. Appendix D.7 below describes this research in more detail and illustrates the consequences of including or excluding executive function from the self control family.
• Locus of control or self-efficacy should never have been mentioned in the NSF proposal as linked to self control, and indeed the one mention of it was an aberration and an error. The concept of self control is intended to measure the degree to which you feel you are able to control your own emotions and behavior versus the degree to which you find you act impulsively or act as your emotions dictate without restraint. It is a skill. Locus of control, xxvi meanwhile, is a perception-a measure of the degree to which a person feels that events can be influenced by their own behavior versus luck or fate. It's possible that self control skills could affect a person's perceptions of self-efficacy. These two concepts are considered distinct, even unrelated, by psychologists. Psychologists view locus of control as a measure of self-regard and often combine it with measures such as neuroticism and self-esteem (Judge et al., 2002;Judge and Bono, 2001).
Instead, the survey included four psychological scales: impulsiveness, conscientiousness, GRIT, and reward responsiveness.
These existing scales typically have many more questions than we could use in the survey (or are commonly used in any assessment). These questions are typically organized into sub-scales to capture subcategories of behavior. We selected questions to use based mainly on whether they were easily understood and familiar to pre-test respondents, but we took care to ensure roughly equal proportions of questions from each sub-scale remained.
Because all personality questions were selected from questionnaires used in the United States, we first translated them into Liberian English by the enumerators, the authors and their research assistants then pre-tested the questions with young men from the same population as the youth in our study (but not members of the study sample).
To ensure that the questions continued to assess the original underlying constructs, we performed two checks. First, within the pre-test data we ensured that groups of questions were correlated or anti-correlated as one would expect given the underlying personality measure (e.g., impulsivity was negatively correlated with conscientiousness). Second, we performed a confirmatory factor analyses to ensure that within scales, questions were answered similarly.

D.5 Anti-criminal and anti-violent self-image/values
We are not aware of prior attempts to conceptualize or measure anti-criminal and anti-violent values and self-image. We developed three measures: self-reported values, prosocial behaviors, and appearance. The first two of these were prespecified in the NSF proposal.
Values and self-image were operationalized in Section 4.4.B of the NSF proposal, as the main "threat" to identifying the operation of the therapy through time preferences and self control skills. We recognized that the therapy could "change norms or beliefs about violence and its acceptability and risks and thereby reduce violence not through an effect on time preferences and self-control but through norms and the intrinsic utility or disutility of violent action." As a result, the proposal suggested that "we should observe a treatment effect of the [therapy] on self-reported norms towards violence, criminality and other antisocial behaviors, and possibly an increase in forms of collective actions, such as contributions to public goods or political participation." As we finalized the intervention, saw the pilot results, and designed the endline survey, we realized this was not a "threat" but interesting in itself, and simply another intermediary mechanism. In the paper, we chose to focus on the direct measure of preferences as opposed to the prosocial behavior (which were actions not preferences, and where by using the word "possibly" we indicated that we did not have strong priors).
As an afterthought, we also happened to measure the enumerator's impression of the respondent's appearance. We didn't conceive of these as measures of an actual skill or preference change, as they are choices or actions. Hence we treated them as an "other" outcomes in the last version of the (1) Most of the time, I will do things for no other reason than that I will enjoy them. xxix paper. One of the referees argued persuasively that our measure of self-image should be changed ex post to include this measure of appearance. We agree and have made that change in the revised paper.

Self-reported values
The closest parallel to our measure of values is the measurement of social norms, where social psychologists ask respondents: (1) what the respondent thinks other people do (descriptive norms); (2) what the respondent thinks other people believe is appropriate (prescriptive norms); and, in some cases, (3) what the respondent him or herself believes is appropriate (an attitude) (Paluck, 2009). We used social norm surveys on behaviors such as bullying and conflict resolution as models for our approach, but had to develop our own original measures suitable to the context and treatment.

Prosocial behavior and appearance
From qualitative interviews (and prior surveys in the country) we also developed a number of locally-relevant prosocial behaviors and appearance measures D.6 Additional intermediary outcomes of interest: Mental health, substance abuse, and social networks Sections 1 to 4 of the NSF proposal made no mention of mental health, substance abuse, or social network change as major hypotheses or intermediary outcomes. Nonetheless, many of these were highlighted in Section 5 of the proposal as control or other outcomes of interest, as it is conceivable that they could be affected by the interventions, and that any change in these could also affect poverty or antisocial behavior. Table D.5 displays the control group means and 12-13-month impacts of all the survey questions that comprise these three families. We describe the composition of each of these measures in Section 6.3 of the main paper. Table D.6 displays the control group means and 12-13-month impacts of all the components of executive function. We decided to measure executive function over time out of other research interests in measurement and consistency of executive function. We did not hypothesize a change in executive function, for reasons noted above, and so these results are not reported in the paper. This is the only surveyed outcome not reported in the paper's main tables. In the following section, below, we show the robustness of results to its inclusion.

D.7 Executive function
In order to measure executive function, our behavioral protocol included three interactive activities drawn from economics and psychology. 16 Planning behaviors We used a series of mazes to test planning behavior. Mazes were unknown to nearly all respondents. Subjects were shown an example maze on paper and then given 2, 2, and 3 minutes respectively to complete increasingly difficult mazes. Each had two entry points, one of which almost immediately led to a dead end. The main outcome of the mazes was the subject's ability to pause and plan their approach before completing the maze (i.e. did they plan their approach before choosing a starting point). As outcomes, we measure "time to first touch", or the amount of time spent planning prior to engaging in the maze; and number of mistakes (or "backtracks") in Maze 3, the hardest maze, which required the most planning and by which time participants had learned the concept of the maze. On average subjects took 18 seconds to plan for Maze 3 (SD = 23 seconds).
Behavioral inhibition and cognitive flexibility We developed the "arrows game", a modified directional Stroop task, a class of tasks that assess inhibitory control. Here subjects were shown a sequence of large black or white arrows that pointed either up or down and were first told to respond "up" or "down" to each arrow ("arrows baseline"). In the second version they were again shown the arrows but now were told to state the opposite direction; this constitutes producing the less common response while suppressing the more common response and is an assessment of inhibition ("arrows inhibition"). Finally, in a third version subjects were told to switch between two approaches: if the enumerators in test administration. Next, in collaboration with experienced enumerators and research assistants, a comprehensive protocol was developed and used by all future enumerators. Enumerators were also instructed to answer clarifying questions and were taught the over-arching concept within each game so they could address questions/alleviate concerns without straying from the central concepts of the tests. This tight control over the testing situation allowed us to collect relatively sophisticated measures of cognitive function and behavioral responses to rewards in a constrained and otherwise under-resourced testing environment.  arrow was white they were to state the actual direction, but the opposite direction if the arrow was black. This is commonly called 'switching' and is an assessment of cognitive flexibility, the ability to move rapidly between two goals as the situation demands ("arrows switching"). For each version, the outcome data included total time to completion and the number of correct/incorrect responses out of 32 arrows. On average subjects made .33 errors (SD = 1.5) on arrows baseline, 2.4 errors (SD = 3.5) on arrows inhibition, and 3.9 errors (SD = 3.9) on arrows switching. Arrows took on average 25 seconds (SD = 17.7), 38 seconds (SD = 45.8), and 46 seconds (SD = 28.7) for baseline, inhibition, and switching separately.
Working memory Working memory is the ability to hold something in mind when it is no longer present in the environment and then manipulate it. The digit span task is an assessment of working memory. The digit span tasks involved the enumerator saying a random sequence of digits (1-9) out loud with a short pause between each digit, followed by the respondent repeating them back either in the same (forward-digits) or the reverse (backwards-digits) order. The enumerator began by giving two 2-digit numbers (one at a time) and recording the responses. If the subject correctly reported either of the numbers back, the enumerator would do the same with 3-digit numbers, and so on up to a maximum of 9 digits. As soon as the subject incorrectly reported both examples at a given level or span the enumerator moved on to the next activity (backwards-digits). The reverse digit span was done the same way, except that the subject was instructed to repeat the digits in the opposite order that the enumerator gave them (e.g., "three, zero, one") On average subjects were able to remember 5.5 digits forward (SD = 1.23) and 3.33 digits backwards (SD = 1.03). Each activity existed as two slight variants (e.g. changing the numbers in the gambles). These activities were alternated in the 2 versus 5-week endlines and the 12 versus 13-month endlines, so that participants were never asked identical questions too close together in time.

D.8 Distinguishing between different measures of "self-control"
Our summary indexes distinguish between self-control skills (assessed by various psychological scales), economic time preferences (using incentivized and hypothetical games), and (as an "other" outcome) executive function. Here we discuss the decision to separate these measures and what happens when we relax that assumption.
First, we treat the difference between time preferences and self control skills as an empirical question. As reported in Section 6.4, they are positively and significantly correlated but with a correlation of 0.33 it is unclear whether they are distinct or not. As we report in Table D.7, combining both into an equally-weighted index leads to large increases in the measure for both the therapy-only group (0.17 SD after 2-5 weeks, 0.18 SD after a year) and therapy and cash group (0.22 SD after 2-5 weeks, 0.26 SD after a year).
Second, we separate executive function from self control as well.
A main reason is that these abilities mature over the lifespan, and psychologists and neuroscientists have emphasized the importance of early-stage investments over late-stage investments because the neuroscientific principle of developmental plasticity, and data from randomizing young children into different early investments suggests that early, but not later investments shape cognitive function (Nelson, 2007).
This is not to say that they are not highly correlated or have common roots early in life. A large literature documents that in some extreme populations (e.g., individuals with substance abuse disorder, kids with ADHD) many of these indices of 'self control' co-vary. That is, kids with ADHD have deficits in performance on inhibition tasks (e.g. Barkley, 1997). These same children, by definition, behave impulsively and appear to be more sensation or risk seeking. Taken together, many have taken this covariance as evidence that these traits are interdependent. There is even a small neuro-imaging literature which suggests that these different forms of impulsivity are subserved by the same neural areas (Aron, 2007).
Nonetheless, there are many hints in the psychology and neuroscience literature that this is an oversimplification. For example, even within extreme populations, sensation seeking and impulsivity, measured similarly, may be differentially linked with behavior (Ersche et al., 2010). In typical developing children, successfully resisting temptation on delay of gratification tasks is not predicted by performance on inhibitory control tasks, but the strategies employed in attempting to resist temptation is (Eigsti et al., 2006).
In fact, the best test is to do what we have done here: randomly assign individuals to an intervention which shifts one of these indices and observe if they all move together. The fact that we see no improvement in executive function is consistent with the skills being different. In Table D.7 we test the combined measures formally, and we do not observe significant increases in a measure combining self control with executive function. Furthermore, their correlation is only 0.15, less than half of the correlation between self control and time preferences E Additional treatment effects analysis E.1 Ignoring the ultimate/intermediary distinction xl are unaffected, but over 12-13-months, the effect of cash plus therapy on antisocial behaviors has a p-value of 0.106. Of course, if we were to adjust for eight comparisons within arms (e.g. if we were testing for specific hypotheses about each arm) statistical significant would be greater.

E.2 Robustness of treatment effects to alternate models
Our robustness tests focus on the five main summary outcomes. First, in Table E.2, we show robustness to alternative ways of constructing the indexes and pooling or averaging of endlines. Columns 2-4 report results from the main paper for comparison. Recall that in this main specification we averaged endline surveys (at 2 and 5 weeks, and 11 and 13 months), took an index of composite measures rather than individual survey questions and used equal weights. In columns 5-7, we do the same except use randomization inference to assess statistical significance. In columns 8-10, we pool our composite measures from both endline surveys and cluster our standard errors by individual.
In columns 11-13, we do the same except weight each survey question equally. In columns 14-16, we use covariance-weighted indexes from Anderson (2008) and average both endlines. 17 The conclusions from these three specifications are quantitatively similar to those from the main specification. Exceptions are as follows: • The impact of cash and therapy on the covariance-weighted antisocial behaviors index is not significant after a year at conventional levels. This is because half of this index's weights come from domestic violence and number of arrests, two components that were unaffected by treatment. If we exclude domestic violence from the index and recalculate covariance weights, cash and therapy lead to a .26 standard deviation decline in antisocial behaviors after a year (column 19, significant at 99% level).
• Cash increases antisocial behaviors after a year in some specifications. In Column 15 we see that after a year the men who report cash only increased their antisocial behaviors by 0.17 standard deviations. In the other specifications, the coefficients are positive as well but smaller and not statistically significant. One possibility is that receiving a cash grant and failing, or having the money stolen, reinforces men's participation in crime. This is largely speculative, however.
Next, we check for robustness to alternative attrition scenarios by bounding treatment effects. We impute outcome values for unfound individuals at different points of the observed outcome distribution. The most extreme bound, from Manski (1990), imputes the minimum value for unfound treated members and the maximum for unfound controls. Following Karlan et al. (2015), we also calculate less extreme bounds by imputing relatively high values of the dependent variables for missing control group members, and relatively low values for missing treatment group members. 18 Specifically, we impute missing dependent variables for the treatment (control) group as the found treatment (control) mean minus (plus) 0.10, 0.25, or 1 SD of the found treatment (control) distribution. Note these imply large and systematic differences between missing treatment and control 17 For this index, each component is weighted by the inverse of the covariance matrix of all index components. Outcomes that are highly correlated with each other receive less weight while outcomes that are uncorrelated receive more weight as they represent new information. We cannot covariance weight the pooled endlines, since they are unbalanced in the sense that some outcome measures appear in only one endline while others appear in both.
18 This assumes the dependent variable points in the positive direction. If treatment leads to a decrease in the outcome variable, as is the case for antisocial behaviors and antiviolent and anticriminal values, we impute in the opposite direction (i.e smaller values for control, larger values for treatment). Notes: The table reports the robustness of our results to alternate index construction and measurement outcome. In columns 2-4, we report results from our main specification, where we average composite measures and do not cluster standard errors. In columns 5-7, we do the same but use randomization inference to get our standard errors. . In columns 8-10, we pool our endline surveys, weight composite measures equally, and cluster standard errors by individual. . In columns 11-13, we pool our endline surveys, weight each survey question equally, and cluster standard errors by individual. In columns 14-16, we weight components using a covariance-weight from Anderson (2008) and average both endlines. In columns 17-19, we remove domestic violence from our antisocial behaviors index, weight survey questions using a covariance weight from Anderson (2008), and average both endlines,. *** p<0.01, ** p<0.05, * p<0.1 xliv members-Columns 8 -10 assume unfound control group member outcomes are roughly 2 SD greater than unfound treatment group member outcomes. Table E.3 reports ITT estimates under these attrition scenarios. Our results are generally robust to these alternate specifications. When X = 0.25 SD, we still observe large and statistically significant changes in antisocial behaviors and our index of mechanisms after a few weeks and also a year. When X = 1 SD, our estimates of treatment effects lose significance but generally point in the correct direction. Meanwhile, the Manski bound brings us closer to having no treatment effects in the medium term term.

E.3 Both versus just one treatment
In this section, we compare the effects of receiving one treatment versus receiving both therapy and cash. Specifically, we test whether the coefficients on either therapy only or cash only in Section 6 are statistically different from the coefficients on therapy and cash. Table E.4 displays the mean difference between treatment effects and corresponding p-value for each of our three main outcome variables.
Our results indicate that cash and therapy compliment each other in reducing antisocial behaviors in the medium-run, while therapy compliments cash in the medium-run mechanisms.

E.4 2-5-week versus 12-13-month treatment effects
In discussing our results, we emphasize differences between outcomes 2-5 weeks after the intervention and outcomes 12-13 months after the intervention. In this section, we test whether the 2-5-week and 12-13-month impacts are the same. We pool our short-term results with our longer-term results and run the following OLS regression: where ShortT erm is an indicator for outcomes measured in weeks 2 or 5, and T is an indicator for treatment group assignment. In our application, we have three treatment groups (therapy only, cash only, and therapy and cash), include baseline controls and block fixed effects, and cluster our standard errors at the individual level i. The size and direction of β 3 determine whether the treatment effects we observe after 2-5 weeks are the same as those observed after a year. Table E.5 reports these estimates for our three main family indexes. For many outcomes, we cannot reject that β 3 is zero. In particular, the short-versus longer-term effects of both therapy and cash are not statistically distinguishable for antisocial behaviors and all mechanisms. However, there are two exceptions worth noting. First, while the cash-only group experienced the largest increase in the economic performance 2-5 weeks after the intervention, these effects diminished a year later. Second, while all three treatment groups saw decreases in antisocial behaviors in the short term, the effects of cash alone and therapy alone subsided 12-13 months later. Notes: The table reports intent to treat estimates of each treatment arm under alternative attrition scenarios. We impute missing dependent variables. In columns 2 -10, we impute missing dependent variables for the treatment group as the found treatment mean minus a multiple of the standard deviation of the treatment distribution. Similarly, we impute missing dependent variables for the control group as the found control mean plus a multiple of the standard deviation of the control distribution. In columns 11 -13 we apply Manski bounds, imputing the minimum value for unfound treated members and the maximum for unfound controls. Each regression controls for baseline covariates and neighborhood-phase fixed effects. The overall summary indexes are the standardized mean of its composite outcomes, standardized. Heterosketastic robust standard errors are reported in brackets. . *** p<0.01, ** p<0.05, * p<0.1 xlvi  Table E.6 reports the incidence of specific crimes reported in the two weeks prior to the 12-13-month survey, breaking down the total number of crimes into the type of crime reported. For consistency, we shift from the incidence of drug selling reported in Table 3 to the frequency-the number of times men reported selling drugs in the past two weeks.

E.5 Crime: Disaggregated and annualized impacts
Control men committed 2.54 crimes in the previous two weeks, and this fell by almost one crime with therapy plus cash. All types of crime decreased by 20 to 100% with cash and therapy, but the statistically significant (and largest proportional) reductions are in burglary, muggings, and scams (e.g. the sale of non-existent goods, or down-payments for a hidden fortune). We do not adjust p-values for multiple hypothesis testing and so these comparisons across crimes should be taken with caution. In general the coefficients are negative and large in proportion to the control mean (20 to 100%) across all types of crime.
If this decline persisted for the year, it would translate to 26 fewer crimes per person each year. Given the $530 cost of the two interventions, this is roughly $21 per crime, ignoring any other benefits of the program. Table E.7 reports impact heterogeneity from an OLS regression of the antisocial behaviors summary index on baseline level of either antisocial behaviors or self control and time preferences, treatment indicators, and interactions between treatment and baseline antisocial behaviors or an index of self control and time preferences, controlling for baseline covariates and block fixed effects. (Recall that our measure of antisocial behaviors is a standardized index with mean zero. Therefore, the coefficient on the treatment indicator represents the treatment effect for an individual with mean xlvii  (1) to (4) report the same ITT regression as in Table 3, with robust standard errors in brackets. Columns (5) and (6) simply multiply the two week estimates by 26 weeks to generate an estimated annual impact per person.

E.6 Heterogeneity analysis on antisocial behaviors
*** p<0.01, ** p<0.05, * p<0.1 level of antisocial behavior at baseline, while the coefficient on the interaction term is the additional effect for individuals whose baseline level of antisocial behaviors was 1 standard deviation higher than average.) We did not prespecify any heterogeneity analysis with antisocial behaviors, and so these estimates must be taken with caution. But these were the only heterogeneity analysis we conducted.
Therapy decreased the incidence of antisocial behaviors for the average participant, but men exhibiting more antisocial behavior at baseline saw larger declines. For example, men with average levels of antisocial behaviors at baseline who were assigned to both therapy and cash experienced a 0.25 standard deviation decline in their level of antisocial behaviors 12-13 months later, but men whose initial level of antisocial behaviors was a standard deviation higher than average experienced about double the decline. Our results also indicate that after a year, men with high levels of initial antisocial behavior who received a cash grant actually increase their antisocial acts. This is especially interesting given that the effects of cash on occupational choice and income disappeared after a year. One possibility is that this increase in antisocial behavior is a reaction to the failed attempt at legitimate livelihoods, but these results are more speculative than anything else.
Our results also indicate that therapy and cash decreased the incidence of antisocial behaviors by 0.25 SD for participants with average self control and time preferences, but the effects were smaller for men who were more patient at baseline. These conclusions remain when we adjust for two comparisons within the "both" treatment arm. 19

E.7 Program impacts on occupational choice
To measure changes in occupational choice, we asked respondents at each endline whether they had engaged in 22 occupations, from farming to petty business, trades, and formal jobs. For each occupation, we collected self-reported earnings and hours in both the last week and the week prior. We use these to calculate the total earnings and hours variables. With two endline surveys, we have four weeks of employment data per person in both the 2-5-week and 12-13-month surveys.
We can also calculate hours by occupations each week, aggregating our 22 occupations into 5 mutually exclusive categories: 1. Non-agricultural high-school work, which includes trading and office work 2. Non-agricultural low-skill business, which includes selling from a shop, selling at a table, buying and selling, engaging in petty trade, and conducting small business 3. Non-agricultural low-skill wage labor, which includes contract work, carloading, car-washing, peim-peim riding, carrying loads, guarding, housecleaning, and construction 4. Agricultural work, which includes farming and fishing, 5. Illicit work, which includes selling drugs, stealing, gambling, gold rubber, and scavenging. shift from illicit work to non-agricultural low-skill business. Those assigned to both therapy and cash experience the largest decline in illicit work. Time spent in illicit work falls 38% 2-5 weeks after implementation relative to the control group, and is 17% less than the control group one year later (although the latter is not statistically significant). Although the cash-only group more than doubles its weekly hours spent in non-agricultural low-skill business in the short-term, these effects phase out 12-13 months later.

Variable selection
We selected six variables for validation, all with recall periods of two weeks. We chose outcomes with varying degrees of salience (or memorability) and potential social stigma and experimenter bias. We wanted very specific behaviors (e.g. stealing rather than any crime, or marijuana rather than substance abuse). Finally, we wanted sensitive outcomes that were a primary focus of the treatment (stealing) and others that were less so (gambling or expenditures). The variables we selected in the end were: 1. Stealing. The survey asked how many times in the last two weeks the respondent stole someli one's belongings or deceived or conned someone of money or goods. 20 Based on our fieldwork, we hypothesized that stealing would be the most salient and least socially desirable of all six measures.
2. Gambling. The survey asked how many times in the last two weeks the respondent gambled or bet on sports. Beforehand, we hypothesized gambling had a lower level of salience and sensitivity than stealing, but was still somewhat stigmatized.
3. Marijuana use. The survey asked how many times in the last two weeks the respondent smoked marijuana. Marijuana use is not socially acceptable across Liberian society overall, but is fairly prevalent in our target demographic. We initially hypothesized underreporting could arise not so much from social stigma but from the discouragement of drug use in the therapy treatment.
4. Homelessness. The survey asked how many times in the last two weeks the respondent had to sleep outside, on the street, or in a market stall because they had no other place to sleep or stay. This is a salient variable where we hypothesized respondents might have under-reported from embarrassment or over-reported in order to appear more needy (and eligible for more programs).
5. Phone charging. In the expenditure section of the survey, the survey asked how many times in the last two weeks the respondent charged his phone for money. This corresponds to taking one's phone to a kiosk with electricity where one pays a small fee to recharge the battery, a common and routine expense for many Liberians, without stigma and possibly not very memorable. 38% of our sample had a mobile phone at the endline, and 38% reported charging a phone in the last two weeks.
6. Video Club Attendance. In the expenditure section of the survey, the survey asked how many times in the last two weeks the respondent went to a video club. These clubs are private businesses where one can go to watch a movie, television show, or football match for a small fee. This is a popular and socially acceptable pastime, as most Liberians do not have electricity or home entertainment. Salience was unclear but likely greater than phone charging.
In part, our use of many non-primary outcomes was deliberate. But, to be frank, our choices were driven more by the practicalities of validation, and in retrospect it would have been useful to focus on more primary outcomes.

Validator staff
Eight local staff performed validations over the two years of data collection. We selected validators from the study's qualitative research staff. These people typically began as survey enumerators, but displayed such skill and rapport with the subjects that we hired and trained them to conduct a separate qualitative research component: longitudinal, formal, open-ended interviews with a different subsample of subjects. All conducted the qualitative validation when they were not working on the formal open-ended interviews. 21 Each validator received at least 10 days of training on the methods, including both classroom learning and extensive field training. We trained more qualitative researchers than were needed for the exercise. Those who exhibited superior performance during the trainings were selected as validators. The aim of the training was to develop and refine trainees' skills in acquiring informed consent, building rapport with respondents, collecting and recording data, and analytical reasoning. Trainings were held for eight hours each day and, over the course of 10 days, transitioned gradually from exclusive classroom learning to field trainings with short debriefing sessions. Field trainings provided trainees with opportunities to practice the skills and techniques they had learned.
Like any qualitative study, we believe staff recruitment and training to have been among the most important tasks and also the largest start-up cost of this method.

Approach
For each respondent, validators tried to determine whether the respondent had engaged in any of the measured behaviors, even once, in the two weeks preceding the respondent's survey date, as the survey asked about behaviors occurring during the two weeks prior to the survey. We found it optimal for validators to visit each respondent four times, on four separate days, with each visit or "hangout session" lasting approximately three hours. The validator aimed to begin hanging out the day after subjects completed their quantitative surveys and to conduct all four visits in the days following the respondent's endline survey date.
Validators deliberately avoided the feeling of a formal interview and would typically accompany respondents as they went about their business. 22 Validators sometimes took notes during visits, but only in isolated areas out of sight from the respondent. 23 The idea follows from basic principles of ethnography, which seeks to study subjects in their natural settings, similar to those the researcher hopes to generalize about. The intent is to reduce the sense of being in an experimental situation, which ethnographers perceive as creating bias.
The main approach was to engage in casual conversation on a wide range of topics, including the six target topics/measures. The target topics were raised mainly through indirect questions while informally chatting. For example, validators typically started conversations with discussions of family. This was both customary among peers in Liberia and a sign of respect and interest in respondents' lives. It was also a stepping stone for discussing the target behaviors-either because the validator can discuss an issue in their family (someone engaging in one of the activities) or how the respondent's family feels about their current lifestyle and circumstances.
In general, validators found it helpful to tell respondents stories or scenarios about another person or themselves, related to the target measures, then steer the conversation to get information about how respondents had behaved in similar situations, eventually discussing the past two weeks. Validators were careful to present these behaviors and incidents in a non-stigmatized light, for instance by discussing a friend who stole in order to get enough to eat, or how they themselves had periods of homelessness or used drugs and alcohol. Validators found these personal stories (all of which were truthful) and genuineness were essential to building rapport and trust.
22 On the first visit validators would obtain verbal consent. We designed the consent script to be informal, and explained that the goal of hanging out with the respondent was to talk about some of the same things they discussed in the survey. In addition to this verbal consent, the formal consent form that preceded the recent survey said that qualitative staff may come and visit them again to gather more information.
23 e.g. in a toilet stall or teashop. If validators were unable to find a secluded area in which to take notes, they sometimes recorded information in their cell phones, pretending to send a text message.
Validators might hold these conversations once or twice over the three hours, spending perhaps twenty or thirty minutes in conversation each time, to avoid unnaturally long or awkward conversations. The validator spent the remainder of the three hours in the general vicinity, observing respondents engaging in their daily activities. This could involve taking a rest in the shade or in a tea shop (as is common) or engaging others in conversation. Validators would also try to talk casually with the respondent's friends, relatives, or neighbors to learn about him (although we considered information from these second-hand sources as insufficient to support a conclusion about the respondents' behaviors, but merely as supporting information).
We found that building a rapport with participants in a short space of time was crucial. To develop trusting and open relationships, validators used techniques, including becoming close to respected local community and street leaders, eating meals together, sharing personal information about themselves, assisting subjects with daily activities, and mirroring participants' appearances and vernacular, as appropriate. In addition, validators tried to maintain neutrality and openness while discussing potentially sensitive topics. For instance, conveying-through stories or otherwise-that illicit behaviors were not perceived negatively, allowed respondents to feel comfortable sharing their involvement in such activities. Validators did not lie to or deceive respondents, however.
Overall, this approach-trust-building, spending time together over the course of several days, assuming the role of an "insider," attempting to obtain admission or discussion of the behavior, clandestine but fairly immediate note-taking, and (as discussed below) close examination of the evidence for each respondent with the investigators-was designed to counter the observer bias and selective recall that concern participant observation. 24 Developing a rapport with respondents, spending time to develop a relationship, and obtaining insider status are considered central to obtaining more honest and valid responses (Baruch, 1981;Bryman, 2003;Fox, 2004). We are not aware of any study, however, that has quantitatively tested this proposition.

Validation sampling and non-response
In each endline survey round we randomly selected study respondents to be validated, stratified by treatment group. 25 Table F.1 describes the samples selected for validation in each survey round over the course of the study. In total, we randomly selected 7.4% of all surveys, 297 in total, for validation.
We found 240 (81%) of the 297. 26 This attrition is an identification concern, but there is little evidence of biased attrition. Excess validation attrition (those who were surveyed but not validated) was not robustly associated with baseline characteristics (see Appendix A.3).
24 For general discussions of validity in qualitative methods, see Wilson (1977); LeCompte and Goetz (1982);Power (1989). 25 For each pair of survey rounds, study participants were randomly divided into blocks (e.g. 1, 2, 3, 4), and block 1 study participants were surveyed before block 2, and block 2 before block 3, etc. Within each block we randomly selected validation subjects using a computer-generated uniform random variable. The selection was performed without replacement in a given pair of survey rounds (e.g. the short-term endline surveys in a given phase), but sampling was performed with replacement across survey rounds. Twenty subjects were validated in more than one round. 26 We could not find 15 for even the endline survey. We could not validate a further 42 because they were difficult to find even immediately after the survey or (more commonly) because they lived a long distance away. In general, we surveyed respondents who had moved far out of Monrovia, but we were unlikely to validate them because of the time and expense and opportunity cost.  Notes: The proportion selected in each round was principally a function of logistical feasibility (e.g. number of available staff), and in some none were selected. As procedures became more familiar and staff more experienced, more could be done over time. The percentage validated in the treatment group includes any treatment (cash, CBT, or both).

Statistical power
In order to minimize the confidence intervals surrounding any treatment-measurement error correlation, we chose the sample size that maximized the number of interviews we felt qualified validators could manage logistically. 27 Post hoc calculations of statistical power confirm the estimates we made at the design stage. With a sample of 240, we can detect general over-or under-reporting greater than 17% of the survey mean (14% of the "true" validated mean). 28 Because each treatment arm is a subsample, however, we cannot precisely measure the effect of treatment on misreporting-it is difficult to detect effects greater than 33% of the survey mean (28% of the validated mean). Thus we are principally interested in the sign and magnitude of the treatment effect on misreporting by treatment group.

Coding validated data
Validators were unaware of the respondents' survey responses, and formed their own opinions (based on the evidence collected) about whether respondents engaged in the six activities during the time period captured by the quantitative survey. Every coding recommendation was then discussed with and vetted by one of the authors.
A core part of the validator training included logical reasoning, supporting reasoning with evidence, and writing this down in a clear and structured manner. After each visit, validators made written notes about the relevant data collected, including evidence to support their conclusions, on a standardized form. At the conclusion of the four visits, the validator coded six indicators, one for each behavior, where "1" meant that he had relatively direct evidence that the respondent engaged in the behavior during the recall period, and "0" otherwise. 29 27 In general, the validation sample was a balanced subsample of the full sample. Power calculations, based on roughly the first 60 validator interviews, indicated that there was a modest degree of underreporting of all behaviors, sensitive and non-sensitive, but that the correlation between treatment status and measurement error was uncertain-across outcomes it varied in sign and magnitude, but was about zero on average. Thus the chief advantage of maximizing the sample conditional on time available was to shrink the confidence interval to build confidence in our method and the main outcomes of interest. Further validation was mainly limited by the number of validators we felt could be trained and supervised. 28 We calculated this minimum detectible effect (MDE) using a two-sided hypothesis test with 80% power at a 0.05 significance level, using baseline and block controls when calculating the R-squared statistic. We calculated an MDE for both the 0-2 expenditures index and the 0-4 sensitive behaviors index. The expenditures index had a mean of .82 in the survey and an MDE of .13 for general over-and under-reporting and .29 for a treatment effect on misreporting. The sensitive behaviors index had a mean of 1.12 in the survey and an MDE of .2 for general over-and under-reporting and .36 for any treatment effect on misreporting. We estimate that doubling the sample size would have increased power by about a third.
29 Over the course of the exercise, different measures offered different experiences and lessons. Because of its relative frequency and visibility, we suspect marijuana use was the easiest to directly observe. But validators found other behaviors straightforward to discuss in conversation. In the survey and (especially) the validation, phone battery charging led to the most confusion-in particular, did simply charging one's phone count, or did only paying to charge one's phone count? Paid charging was the focus of the survey question (it appeared in an expenditure survey module), but we were concerned that the validators would use a more expansive definition. We attempted to mitigate such differences through trainings and regular discussions on the coding.
Homelessness also proved somewhat challenging to measure and validate, as we discovered its definition is subjective. Circumstances arose that were somewhat ambiguous, such as having no home of one's own but regularly sleeping on a friend's floor or in an acquaintance's market stall. To account for the potential variability in perceptions of homelessness, validators were instructed to include as much information as possible about respondents' living situations in their summary reports. The authors then worked with validators to code a somewhat broad definition of homelessness that included any ambiguous circumstances. Prior to analysis, it was not clear whether survey respondents applied the same definition, and hence we err on the side of finding underreporting in the survey.
Validators recorded an average of 1.35 "major" pieces of evidence per respondent per behavior to support their coding decision sheets. This was typically the most persuasive piece or pieces of evidence rather than all evidence collected. 30 Table F.2 reports evidentiary methods by behavior. In general, the validators used some form of direct or indirect questioning-a direct admission of the behavior or persuasive statements that they did not engage in the behavior. The validators only witnessed or found direct evidence of the behavior in a fifth of cases, or had third party verification in about 6% of cases. In any event, witnessing or third party verification were not sufficient evidence for a final coding. For instance, witnessing had to be followed by questions confirming that the respondent also engaged in the behavior in the two weeks prior to the survey. This accounts for most of the cases where there was more than one piece of evidence highlighted.
In general, the patterns of evidence are fairly commonsensical. Witnessing is limited to observable behaviors such as marijuana, gambling, homelessness, and phone charging. Stories and scenarios where the respondent is invited to comment or discuss are especially common for the most sensitive subject, stealing. Indirect questioning is most common for everyday topics such as homelessness ("Is this your house?") and phone charging ("I need to charge my phone. Where do you usually charge yours?").

Limitations of the approach
While we think, based on our experiences, that this validation exercise gave enough time to gather detailed, accurate information and fostered trust and frankness, there are nonetheless limitations to this approach.
1. Potential disruption. The presence, and interactions and conversations with the validators may be intrusive and might disrupt respondents' daily activities, thereby altering the findings. To mitigate this risk, validators wore clothes that would blend in with their respondent's environment, and typically accompanied and assisted respondents in their activities as appropriate (e.g. helping a scrap metal collector scavenge).
2. Differences in recall periods. The validation occurred after the time period about which the survey questions had asked, and validators or respondents could have made errors about the relevant window of time (e.g. homelessness could have been observed the week after the survey, and inferred to the time of the survey incorrectly). This is most likely a source of random measurement error.
3. Inconsistent questions. The survey and validation questions might have been interpreted differently, making it difficult to compare results. As discussed above, phone charging and homelessness proved somewhat difficult to measure consistently. We used close consultations and reviews of the data, and focus groups with survey and validation staff, to maximize consistency. Notes: Direct questions imply the validator asked the respondent directly about his engagement in the activity. Indirect questions imply the validator brought up the subject in general conversation (Where do you live? What do you do to make money?). Stories and scenarios are a form of indirect questioning where the respondent is invited to comment. Witnessing or found evidence implies the validator saw the respondent engaging in the activity in question or found physical evidence that the respondent recently engaged in the activity. Third party accounts imply the validator asked the family and friends of the respondent whether or not he engaged in the activity. Other or unclear methods include a handful of cases of unprompted information from the respondent, and also cases where the behavior could be inferred from other knowledge.
Mainly it implies that coding was inconclusive or incomplete but is likely a form of questioning.
lviii discussion and (usually) a direct admission of the behavior. Also, one of the authors reviewed and discussed the evidence for every subject with the validator.

5.
Increasing social desirability bias. In principle the participant observation method, by building rapport, could lead to a different source of measurement error by (for example) increasing social desirability bias. Our strong sense is that the opposite is true, that trust and rapport reduced the bias, but this is a subjective interpretation and not independently verifiable.
6. Consistency bias. In principle, respondents could recall their survey response and try to remain consistent despite trust-building. This could motivate randomizing the order of validation and survey in the future.
7. Non-blinded validators. The researcher is not immune from bias in qualitative research (LeCompte and Goetz, 1982;LeCompte, 1987). We are especially concerned with any bias correlated with treatment. While validators weren't given the subject's treatment status, it's possible and even likely that this could come up during the extended conversations. Thus there is a danger that the validators' biases will be correlated with treatment. The trust-building and preference for direct admission of the behavior was intended to mitigate this risk, but it still remains.
Most importantly, it seems unlikely that validators would commit most of these errors differentially across study arms. Misreporting correlated with treatment is still a risk under the consistency bias and non-blinded limitations, but the in-depth focus on a handful of questions, time invested, and trust-building is designed to counteract these biases as much as possible. If so, the qualitative validation method may be most useful at building confidence estimated treatment effects.
Finally, like any qualitative work, this is not an off-the-shelf tool. To select and refine the variables, recruit and train validators, and monitor quality of the data requires the researcher to have some familiarity with the context and population and at least basic experience in qualitative data collection.

Replicability of the approach
There are three reasons to think that this method could be replicated in other developing country field experiments and observational analysis using surveys. First, the expertise needed to implement the method effectively exists in most countries. Indeed, it should be considerably simpler to implement outside than inside Liberia. After fourteen years of civil war, and with one of the lowest human development indices in the world, Liberia has very low local research capacity, even compared to other poor and post-conflict states.
Second, most social scientists are nearly as well prepared to design and implement the approach as they are a new survey instrument or measure. Like any measure or method, it takes local knowledge, care, and extensive pretesting to develop a credible approach, and can benefit from someone with expertise in the subject area. In our case, one of the field research managers had some background in qualitative work and quality assurance, which we believe improved the quality of training and selection of the validator staff.
Third, the cost of the data collection is not necessarily large relative to many field experiments or large-scale panel surveys. In this instance, the fixed cost of startup was primarily in the recruitment lix and training of the small number of validators-approximately 2 to 3 weeks of work. We estimate the marginal cost of validation was roughly $80 per respondent, mainly in wages and transport. By comparison, the marginal cost of surveying a respondent was roughly $70. 31 While this method is considerably more expensive than survey experiments, it is more in line with the depth and cost of commonplace efforts to improve consumption measurement through the use of diaries physical measurement. 32 For crucial measures in large program evaluations, or for statistics informing major policies, the cost is small relative to the intervention, larger study, or larger purpose. For instance, as a proportion of total expenditures on the study, this validation exercise cost under 3% of all research-related costs, and less than 1-2% of program plus research costs.

F.2 Further analysis
Misreporting levels Table F.3 reports our proxy of survey over-reporting: the simple survey-validation differences, with p-values from a t-test of the difference from zero. Negative values indicate survey under-reporting, assuming the validator measure is more accurate of course. As noted above, we have the statistical power to detect differences greater than about 17% of the survey mean.
Overall, gambling seems to be slightly underreported in every treatment arm, and highly underreported by men in the control and cash only groups. For instance, 33% of the cash only group admitted to gambling during validation, compared to 13% during the survey. Some of this underreporting could be due to ambiguous behaviors being coded as gambling in validation interviews but not in the survey. But the fact that underreporting is smaller in the therapy arms suggests that the underreporting is not an artifact of different definitions, but rather reflects a strategic response to treatment status.
If we look at stealing, marijuana use, and homelessness, however, none of the survey-validation differences are statistically significant. There is possibly some slight underreporting of drug use and slight over-reporting of stealing, but the magnitudes are generally small in the sense that they are less than 10% of the survey means reported in Table 9. The sample size is small, however, and so many of these differences are not precisely estimated.
We see much stronger evidence of underreporting of expenditures in the survey. The difference for both expenditures is -0.27 in the full sample (Table F.3, Column 6). This difference is large-about a third of the survey mean reported in Table 9. Expenditure underreporting is largest for the video club measure, but both expenditures appear to be underreported. Interestingly, the mean differences appear to be smaller and less statistically significant if the men received one of the treatments. We return to these differences across treatment arms below.

Patterns of survey under-and over-reporting
In our validation exercise, there may be cases where the validation technique did not report a behavior that was reported in the survey. Table F.4 reports the number of cases where the survey 31 Both figures were driven by the fact that it typically took one to two days of searching to find each respondent for surveying, plus the time to survey itself. Both surveying and validating in Liberia were expensive by the standards of household surveys, largely because of the cost of operating in a fragile, post-conflict state and the great difficulties in tracking such an unstable population. 32 In one extreme example, in the India NSS consumption survey, enumerators physically measure the volume of all food consumption Group (2003). and validation measures do not agree, divided into cases of survey over-and under-reporting relative to the validation measure. Over-reporting is driven by stealing, gambling, homelessness and going to the video store. Over-reporting is limited for marijuana use and phone charging, which are some of the least ambiguous and most habitual activities.
Another way to understand this point is to rerun equation 3 in the paper but omit block fixed effects and restrictβ 1 = 0 andβ 3 = 0: y s i =β 0 +β 1 T i +β 2 y v i +β 3 (y v i × T i ) +μ i .
In this case,β 0 is an estimate of survey over-reporting. Table F.5 reports these results, Panel (a) with the restrictionsβ 1 = 0 andβ 3 = 0 and Panel (b) without, for comparison. Looking at the sensitive behaviors in Panel (a), we see evidence of survey overreporting ranging from rough 12 lxi to 15%. Moreover,β 0 andβ 2 are relatively similar in both Panels (a) and (b), suggesting that treatment has little effect on this survey over-reporting.
We do not know for certain why up to 15% of people would report a sensitive behavior in the survey but not in the validation exercise but there are several plausible explanations. First, survey respondents may not have considered the "last two weeks" recall period carefully, and reported behavior over a wider range. Validators were trained to be more strict with the recall window. Second, although we tried our best to maintain consistent definitions across the survey and the validation exercise, validators might have used more restrictive definitions of the behavior in question. Finally, validators may simply have been more conservative in their coding of these behaviors, or set too high a bar for certainty.
They key, however, is that there is no evidence that misreporting is associated with treatment status-which itself is the core finding from the general analysis of the validation exercise.

Treatment effects in the overall sample versus the validation sample
In this section we investigate whether the treatment effects observed in the validation sample are similar to those observed in the validation sample. Panel (a) of Table F.6 takes the survey measures of our six validated outcomes, and reports ITT estimates in the validated sample (N=238) and the full sample. Panel (b) takes the validator measures of our six outcomes, and reports ITT estimates in the validated sample (N=238). Although the validation sample only has 238 observations, and so standard errors are large, the estimate treatment effects are qualitatively similar across all three sets of regressions.

F.3 Adjusted treatment effects
We estimate the effect of each treatment on survey over-reporting, in Table F.7. These estimates effectively take the simple survey-validation differences in Panel A of Table 10 and estimate the difference across treatment arms, adjusting for baseline covariates as well as block fixed effects. We use these to calculate an adjusted treatment effect.
First, the results imply that the adjusted treatment effect of therapy and cash on sensitive behaviors overall is no lower than what we estimate with self-reported survey data, and may even be larger (Column 1). This holds true for each of the individual sensitive behaviors, save marijuana use. Despite the large standard errors introduced by the small validation sample, the adjusted treatment effect on all sensitive behaviors is larger and significant at the 1% level.
Meanwhile, the underreporting of gambling does not have a statistically significant association with treatment. Those who received cash alone underreported gambling to the surveyors more often than control group members, and so the measurement error in gambling is probably a combination of a general desirability bias as well as one correlated with treatments. A larger sample size would be needed to separate these more precisely.
In contrast, the slight underreporting of expenditures behaviors in the survey (seen in Table F.3 above) implies that the short term increase in survey-based expenditures due to cash could be due to measurement error correlated with treatment. The adjusted treatment effect of therapy plus cash is generally negative but not statistically significant (Column 6). We see a similar pattern with another expenditure-related item, homelessness, in Table F.7-the survey-reported decline in homelessness tends to disappear with adjustment. lxii