PROTOCOL: Body‐worn cameras’ effects on police officers and citizen behavior: A systematic review

Body‐worn cameras (BWCs) are one of the most rapidly diffusing technologies in policing today, costing agencies and their municipalities millions of dollars. This adoption has been propelled by highly publicized events involving police use of force or misconduct, often linked to concerns of racial and ethnic discrimination (see general discussions by Braga, Sousa, Coldren, & Rodriguez, 2018; Lum, Stoltz, Koper, & Scherer, 2019; Maskaly, Donner, Jennings, Ariel, & Sutherland, 2017; Nowacki & Willits, 2018; White, 2014). In culmination, these contexts fostered enough public and political will to generate an urgent call for BWCs. This demand was matched with a prepared supplier; technology companies had already been developing both BWCs and other similar surveillance devices (e.g., in‐car cameras, license plate readers, and closed‐circuit televisions). In the United States, an estimated 60% of local police departments have fully deployed BWCs (Hyland, 2018). Similar widespread testing, piloting, and adoption of BWCs have also occurred in the United Kingdom, Australia, and Europe. Given the rapid and widespread adoption of BWCs, their significant costs, and their potential impacts on law enforcement agencies and the communities they serve, an important question for practitioners, government officials, and researchers is whether the cameras effectively achieve the expectations of them. In their narrative review of empirical BWC research, Lum et al. (2019) suggest there may be equivocal answers to the question of BWC effects. This systematic review of BWCs reviews and synthesizes existing research to examine these concerns. 1.2 | The intervention and how it might work

may also similarly serve to deter individuals that officers encounter.
For example, civilians may see the BWCs (or be alerted to them verbally by officers) and then moderate their behavior accordingly.
Existing research both supports and challenges the deterrence and self-awareness hypothesis of BWCs. As Lum et al.'s (2019) narrative review has found, early research seemed to show that BWCs reduced the use of force by officers. More recent findings, however, have been mixed.
The hypothesized self-awareness imposed on officers by the cameras may also affect their use of arrests, citations, and proactive activities, An additional challenge to understanding BWC effectiveness is that survey research has shown there is an incongruence in the expectations that police and community members have for BWCs (Lum et al., 2019).
The meaning of "effectiveness" for the police may not be the same as what "effectiveness" means for civilians. The police may view cameras as effective when they protect officers from frivolous complaints and assaults and when they strengthen officers' ability to arrest and prosecute offenders. Civilians, in contrast, may judge BWC effectiveness by whether cameras provide greater accountability and transparency for officer actions and protect the public against excessive use of force and officer misconduct. And, while some of these effects can be explained from a deterrence perspective, other effects are purely technical or organizational. For example, BWCs are also believed to improve investigations and case clearances. Here, the theoretical mechanism is straightforward: if a crime or an important piece of evidence is captured on an officer's BWC, it can be used to more effectively prosecute offenders. Organizational effects may include the effects of BWCs on training, which delves into the realms of educational theory (i.e., visual aids for learning may create better retention of experiential knowledge for application in the future).

| Why it is important to do this review
Because the rapid adoption of BWCs was driven by public protest, law enforcement concerns, government funding, and the development and marketing of portable video technology, it should not be any surprise that BWCs were quickly adopted in a low-research environment (Lum, Koper, Merola, Scherer, & Reioux, 2015). The importance of scientific inquiry about police technologies like BWCs, however, cannot be overstated. If law enforcement-and ultimately, citizens-intend to invest heavily in BWCs, then BWCs should produce the outcomes we expect of them.
Unfortunately, however, researchers have consistently found that police technologies may not lead to the outcomes sought and often have unintended consequences for police officers, their organizations, and citizens (Chan, Brereton, Legosz, & Doran, 2001;Colton, 1980 In their narrative review, Lum et al. (2019) concluded that although it appears that many agencies and officers support BWCs, BWCs have not consistently had the effects intended by either police officers or community members. They argue that anticipated effects may have been "overestimated" and that behavioral changes in the field may be "modest and mixed." They also discuss that while study findings have indicated that complaints have declined in many evaluations of BWCs, it is unclear why the decline occurs and whether the actual interactions or relationships between the police and the public have improved. There are some outcomes that have not been investigated-in particular, the impact of BWCs on racial and ethnic disparities in policing outcomes, the alleviation of which was a major reason for some communities to push for BWC adoption. At the same time, Lum et al. state that BWCs will continue to be adopted by police agencies, which makes the production and synthesis of rigorous research even more essential to this policy area. what we can conclude about the impacts of BWCs from the existing research. Perhaps findings might be conditioned by the quality of research studies and designs, the location and timing of evaluations conducted, or even by the groups involved in the research (much of the BWC research has been clustered amongst groups of researchers at specific universities). A major concern with BWC outcome evaluations has been the extent of contamination between the treatment and control groups, as well as how outcomes are measured. Lum et al. (2019) acknowledge the importance of a systematic review to parse out these important aspects of studies.

| OBJECTIVES
The primary objective of this review is to synthesize and explore the evidence on the impacts of BWCs on several outcomes of interest to police, policymakers, and the wider community. Specifically, given the existing research found by Lum et al. (2019), this review will focus on examining two categories of effects of BWCs: The impact of BWCs on officer behaviors, as measured by officer use of force, complaints, arrest and citation behavior, and proactive activities.
We note that changes in citizen complaints might also be a measure of civilian behavior, as discussed above. However, for this review, it will be used as a measure of officer behavior.
The impact of BWCs on civilian behaviors, as measured by community members' compliance with police commands (to include resisting arrest or assaults against officers). Studies examining this type of impact may also include evaluations of whether BWCs deter criminal or disorderly conduct of community members. Additionally, some studies have examined citizen willingness to call the police (either as a victim or witness) or cooperate in criminal investigations.
The second objective of this review is to explore explanations for variations in effect sizes and directions of effects that are likely to be found across studies. Explanations could be due to variations in the location, context, quality, or characteristics of research studies. Overall, the goal of the review will be to provide practical information to police agencies, municipalities, governments, and citizens as to

| METHODOLOGY
Given the two objectives discussed above as well as variations in outcomes measured, this systematic review will be organized into two sections (impacts on officer behavior and impacts on civilian behavior). The review may be further broken down into different subareas depending on the outcomes measured (see Gill  analysis-of-covariance, and propensity score matching, among others. Use of a statistical control method is sufficient for inclusion; we will not exclude studies based on a subjective assessment of the quality of the statistical controls. Rather, any quasi-experimental design that controls for possible explanations for BWC outcomes, such as officer characteristics (race, gender, age, time in service, rank, etc.) or civilian or event characteristics (race, gender, age, situation, the reason for the stop, etc.) will be eligible. Quasiexperimental designs that do not have a comparison group or do not use the above methods to achieve comparability are not eligible for inclusion in this review.

| Types of participants
Given the objectives outlined above, the population of interest is law enforcement officers and civilians. However, it should be noted that the units of analysis could vary, to include officers, groups of officers, officer-shift combinations, and non-law enforcement personnel (community members, citizens, etc.).
In the original title for this protocol, we also listed the "police organization" as being impacted by BWCs. However, given that scarcity of experimental or quasi-experimental research on BWC impacts on organizations, we will not be pursuing this area of BWC

| Types of interventions
Studies that examine the use of BWCs by law enforcement officers will be eligible for this review. Excluded are studies that focus solely on the use of BWCs for interrogations in an interrogation room within the police agency. We also note that in the Title submitted for this review, we suggested the possibility of examining case or investigative outcomes. However, we decided to exclude this category of studies from this review. There is a large body of research that examines the impact of videotaping more generally on interrogations and interviewing of suspects, witnesses, and victims, and the use of videos and court outcomes. Given that these outcomes do not specifically focus on the impact of cameras on officer and citizen behavior and given the overlap of this area with other unrelated areas (investigatory effectiveness using video technologies), we excluded this area of research from this review.

| Duration of follow-up
The expected effects of BWCs are immediate. That is, they are presumed to have an effect while they are being used. As such, the outcomes are measured concurrently with the intervention and no follow-up period is needed in assessing their effects. Some studies do measure the longer-term effects of BWCs, but in these studies, the BWCs are still in use. While we do not expect to find studies that measure effects at a follow-up period after the BWCs are no longer in use, if such a study is located during the search and screening process, it will be included in the review.

| Search strategy and screening process
The search for BWC research will be led by the Global Policing To capture studies for this review, we will use BWC specific terms to search the GPD corpus of full-text documents that have been screened as reporting a quantitative impact evaluation of a policing intervention. Specifically, we use the following terms to search the title and abstract fields of the corpus of documents published between January 2004 1 and December 2018: The results of this search will then be processed using a twocoder system. The abstract for each study found from the search above will be examined by two coders, who will separately determine whether a study is "potentially eligible" for further full-text review given this protocol's criteria; "not eligible" for further full-text review; "unclear" (the coder could not make a determination given the information given); or a "relevant review" (the article is not a study, but should be flagged as a relevant review of studies. The codes from both coders will be reviewed for differences by a principal investigator (Lum or Koper). Each difference will then be discussed and if needed, mitigated by a third coder. Studies with differences that persist and cannot be mitigated (specifically if one coder continues to believe a study is "potentially eligible") will be retained and the full text of the study will be examined in the next screening process.
After reviewing the initial abstracts from the GPD, the research team will also examine whether any relevant studies from Lum et al. Once studies are determined by at least one coder to be "potentially eligible", the full-text document of each study will be obtained and examined separately by two coders for eligibility according to the "Criteria for Including and Excluding Studies" as described above. Studies must satisfy these criteria in order to be included in the systematic review. If the coders differ in their assessment, a third coder will be used to examine the study for eligibility. If a study continues to draw debate, other coders and expert may be consulted to determine its eligibility for the systematic review.

| Criteria for determination of independent findings
The primary unit-of-analysis for this review will be a research study defined as a distinct sample of study participants involved in a common research project. Multiple reports (e.g., publications, technical reports, etc) from a common research study will be coded as a single study. Stated differently, a research study will only be treated as unique if the study sample does not include study participants included in any other coded study. Multiple effect sizes will be coded, if possible, from studies when multiple outcomes are analyzed. Statistical independence will be maintained or modeled in all statistical analyses.
The choice of outcomes in this review will be prioritized. For example, while there are many different types of use of force (e.g., hands only, nonlethal instruments, firearm use) and complaints (i.e., complaints of rudeness, service delivery), we will select the most general measure of use of force or complaints measured (i.e., counts of reports of use of force or complaints generated). Additionally, there are many different types of crimes and infractions that may receive arrest and citations, but only the most general measure of arrest and citation will be measured (i.e., "all arrests" or "all citations"). Similarly, for non-police civilian behaviors, the more general behavioral categories will be measured (i.e., "resisting arrest," "assault on officers," etc). With regard to officer proactivity, a decision may need to be made as to whether to examine the overall levels of proactivity or specific types of proactivity (i.e., stopquestion-and-frisks, traffic stops, pedestrian stops, problem-solving, community policing, etc). As Lum et al. (2019) discuss, not all proactive activities are viewed similarly by either the police or community members, and may need to be parsed out during the analysis to examine BWCs impacts on different types of proactive police behaviors. Finally, for the impacts of BWCs on investigative case files, general categories will be used, including "arrest" or "conviction."

| Details of study coding categories
Per Campbell policy, all studies will be double-coded. Detailed coding categories and instructions are presented in the Appendix C. Coding will include information on the nature of the BWC use, comparator condition, contextual features of the agency, method and design features, dependent measures and effect sizes for the above outcomes, and risk-of-bias indicators. Data will be maintained in a 1 While the GPD data extends back to 1950, to date full-text documents have only been screened back to 2003. For this systematic review, the authors believe the use of the GPD is justified, as the earliest recorded evaluation for BWCs according to Lum et al. was Goodall (2007). Additionally, per the GPD search protocol, grey literature will also be searched (see Appendix B). 2 See https://bwctta.com/resources/bwc-resources/impacts-bwcs-use-force-directoryoutcomes and https://bwctta.com/resources/bwc-resources/impact-bwcs-citizencomplaints-directory-outcomes. relational database (MySQL) with coding forms developed in LibreOffice Base (similar to MS Access).

| Statistical procedures and conventions
Based on prior work by Lum et al. (2019), we expect to find a sufficient number of studies to conduct a meta-analysis for the three broad outcomes described above. However, given the various outcomes and study designs that are likely to be found, a variety of approaches to calculating effect sizes will have to be used, as described by Lipsey and Wilson (2001). Various effect sizes will be then converted to Cohen's d except for outcomes that are more naturally measured dichotomously, in which case the odds ratio will be used. Calculation techniques as described by Lipsey and Wilson and the online effect size calculator developed by David Wilson will be employed.
A meta-analysis will be conducted using random-effects models estimated via full-information maximum likelihood. Primary analyses will be performed using Stata packages developed by David B. Wilson and available at http://mason. gmu.edu/~dwilsonb/ma.html. The robust standard error method of modeling statistical dependences will be implemented with the Stata package robumeta (see http://www.northwestern.edu/ipr/ qcenter/RVE-meta-analysis.html for details). Moderator analyses of a single categorical variable will be fit using the analog-to-the-ANOVA method, also under a random-effects model. Moderator analyses of continuous moderators or multiple moderators will be conducted with meta-analytic regression methods, also under a random-effects model. Results will be presented separately for experimental (randomized) and quasi-experimental designs, although these may be combined in moderator analyses.
Publication-selection bias will be assessed in three ways.
First, analyses will compare the results from published and unpublished reports. Published documents will include peerreviewed journal articles, books, and book chapters. All other report forms, such as theses, technical reports, government and agency reports, will be considered unpublished. Second, we will perform a trim-and-fill analysis on the major outcome categories.
Third, we will visually inspect a funnel plot on the major outcome categories.
Sensitivity analysis will be conducted if needed, based on initial findings. As already discussed, moderator analysis will be conducted for this review, and can include (but is not limited to) the following: • Whether the study was done inside or outside of the United States The free-text search terms for the GPD are provided in Table A1 and are grouped by substantive (i.e., some form of policing) and evaluation terminology. Although the search strategy across search locations may vary slightly, the search follows a number of general rules: • Search terms will be combined into search strings using Boolean operators "AND" and "OR". Specifically, terms within each category will be combined with "OR" and categories will be combined with "AND". For example: (police OR policing OR "law#enforcement") AND (analy* OR ANCOVA OR ANOVA OR …).
• Compound terms (e.g., law enforcement) will be considered single terms in search strings by using quotation marks (i.e., "law*enforcement") to ensure that the database searches for the entire term rather than separate words.
• Wild cards and truncation codes will be used for search terms with multiple iterations from a stem word (e.g., evaluation, evaluate) or spelling variations (e.g., evaluat* or randomi#e).
• If a database has a controlled vocabulary term that is equivalent to "POLICE", we will combine the term in a search string that includes both the policing and evaluation free-text search terms. This approach will ensure that we retrieve documents that do not use policing terms in the title/abstract but have been indexed as being related to policing in the database. An example of this approach is the following search string: (((SU: "POLICE") OR (TI,AB,KW: police OR policing OR "law*enforcement")) AND (TI,AB,KW: intervention* OR evaluat* OR compar* OR …)).
• For search locations with limited search functionality, we will implement a broad search that uses only the policing free-text terms.
• Multidisciplinary database searches will be limited to relevant disciplines (e.g., include social sciences but exclude physical sciences).
• Search results will be refined to exclude specific types of documents that are not suitable for systematic reviews (e.g., newspapers, front/back matter, book reviews).
We note that there is a substantial overlap of the content coverage between many of the databases. Therefore, we have used the Optimal Searching of Indexing Databases (OSID) computer program (Neville & Higginson, 2014) Table A2.
Appendix: GPD systematic compilation strategy

Inclusion criteria
Each record captured by the GPD systematic search must satisfy all inclusion criteria to be included in the GPD: timeframe, intervention, and research design. There are no restrictions applied to the types of outcomes, participants, settings or languages considered eligible for inclusion in the GPD.

Types of interventions
Each document must contain an impact evaluation of a policing intervention. We define a policing intervention is some kind of a

Types of study designs
The GPD includes quantitative impact evaluations of policing interventions that utilize randomized experimental (e.g., RCTs) or quasi-experimental evaluation designs with a valid comparison group that does not receive the intervention. The GPD includes designs where the comparison group receives "business-as-usual" policing, no intervention or an alternative intervention (treatment-treatment designs).
The specific list of research designs included in the GPD are as follows: • Systematic reviews with or without meta-analyses • Crossover designs • Cost-benefit analyses • Regression discontinuity designs • Designs using multivariate controls (e.g., multiple regression) Only 10% of the content in this database have abstracts and a full-text search returns > 250,000 results because of inability to construct complex search strings. Therefore, a modified search of the unique titles across these collections will be more pragmatic than a full search of the database.  None.
• Unmatched control group designs without pre-intervention measures where the control group has face validity

Systematic screening
To establish eligibility, records captured by the GPD search progress through a series of systematic stages which are summarised in Figure   A1, with additional detail provided in the following subsections.

Title and abstract screening
After removing duplicates, the title and abstract of record captured by the GPD systematic search is screened by trained research staff to identify potentially eligible research that satisfy the following criteria: • Document is dated between 1950 to present • Document is unique (i.e., not a duplicate) • Document is about police or policing • Document is an eligible document type (e.g., not a book review) Records are excluded if the answer to any one of the criteria is unambiguously "No," and will be classified as potentially eligible otherwise. Records classified as potentially eligible progress to fulltext document retrieval and screening stages.

Full-text eligibility screening
Wherever possible, a full-text electronic version of eligible records will be imported into SysReview. For records without an electronic version, a hardcopy of the record will be located to enable full-text eligibility screening. The full text of each document will be screened to identify studies that satisfy the following criteria: • Document is dated between 1950 to present; • Document is unique; • Document reports a quantitative statistical comparison; • Document reports on policing evaluation; • Document reports in quantitative impact evaluation of a policing intervention; and • Evaluation uses an eligible research design.

Appendix: Coding forms and instructions
Version: July 8, 2019 Note: This is a living document that will be updated and modified during the coding process as decisions are made regarding how to handle edge cases or other refinements that are made to the coding protocol.
Coding will be done directly into a MySQL relational database using Libreoffice Base as the front-end with detailed coding forms that reflect the coding protocol.

Initial eligibility screening
Initial eligibility screening will be performed on all titles and abstracts identified by the bibliographic search. Two coders will independently assess whether the title and abstract suggest that the study may meet the full eligibility criteria (see protocol). Each coder will determine if the reference is "potentially eligible", "not eligible", a "relevant review", or that it is "unclear". A reference marked as unclear will be assessed by another coder. Any reference marked as "potentially eligible" by either coder will move forward to a full-text eligibility assessment.

Final (full-text) eligibility screening
For full-text screening, each document (reference) will be assessed against the four criteria of the eligibility criteria (see protocol or database coding form for Final Screening). For each criterion, answer "yes", "no" or "uncertain". If the coder answers "yes" to all four criteria, code the reference as "Eligible". If the coder answers "no" to any item, code the reference as "Not Eligible". If there are mix of "yes" and "uncertain" then another coder must make the final assessment.
Two coders will assess each reference for eligibility. Any discrepancies will be resolved through a consensus process.  (in the database, this table is named "3_Outcome") and is the combination of the following fields: StudyID, SubStudyID, OutcomeID, and CoderID. The StudyID and SubStudyID.
1. Study ID & Substudy ID: These two fields uniquely identify the study and sub-study and should correspond to a record coded at the study level (see study level coding instructions). this table is named "4_EffectSize") is the combination of the following fields: StudyID, SubStudyID, OutcomeID, ESID, and Coder. The StudyID and SubStudyID should uniquely identify which study/substudy is being coded. The OutcomeID indicates which outcome coded at the outcome level is associated with this effect.
The coding form has input fields for nine different ways to compute an effect size. There are times when a unique effect can be computed in more than one way. Select the method that is most accurate. For example, a study may report both means, standard deviations, and sample size information along with an independent t-test associated with these means. Both the former and latter can be used to compute the effect size and should produce the same value. However, the t-test, unless reported to 3 or more digits, is likely to be less precise due to rounding error than the raw means and standard deviations.
In contrast to the above, the same outcome may be analyzed in different ways that would produce different effect sizes. For example, a study may report the raw means and t-test but also report a regression model with a treatment dummy code that adjusts for baseline covariates. In such a situation, code two effect sizes, one based on the means and one on the regression model (i.e., method 7 on the coding form). These should be coded as separate records in the database.
1. Study ID and substudy ID: These two fields uniquely identify the study and sub-study and should correspond to a record coded at the study level (see study level coding instructions).

2.
Outcome ID: This field uniquely identifies the outcome associated with this effect size. This should correspond to a record coded at the outcome level (see outcome level coding instructions).
3. Effect Size ID: Assign each effect size for a given StudyID +SubStudyID combination a unique number, starting at 1. For example, if a Study/SubStudy has four effect sizes, these should be numbered 1, 2, 3, 4 in this field.

Date
Modified: This is auto-generated and is a timestamp for the last time any changes were made to this record. Also, if you computed the effect size by hand, include relevant information so that another coder could replicate your computations in this field (e.g., insert R code if applicable).
8. Description of the timing for the effect size. Use this text box to describe the timing for the effect size. For example, the data for the effect size may reflect a 6-month period following the start of the use of BWCs.

9.
Timing for the effect size. Indicate if this effect size is measured at baseline (or pretest). For all effect sizes after the start of the use of BWC, select the post-test 1, 2, 3, or 4 sequentially (e.g., the first post-test is 1, next is 2, etc.).
8. Logistic regression. Results from a logistic regression model.