Measures of fidelity of delivery of, and engagement with, complex, face‐to‐face health behaviour change interventions: A systematic review of measure quality

Purpose Understanding the effectiveness of complex, face‐to‐face health behaviour change interventions requires high‐quality measures to assess fidelity of delivery and engagement. This systematic review aimed to (1) identify the types of measures used to monitor fidelity of delivery of, and engagement with, complex, face‐to‐face health behaviour change interventions and (2) describe the reporting of psychometric and implementation qualities. Methods Electronic databases were searched, systematic reviews and reference lists were hand‐searched, and 21 experts were contacted to identify articles. Studies that quantitatively measured fidelity of delivery of, and/or engagement with, a complex, face‐to‐face health behaviour change intervention for adults were included. Data on interventions, measures, and psychometric and implementation qualities were extracted and synthesized using narrative analysis. Results Sixty‐six studies were included: 24 measured both fidelity of delivery and engagement, 20 measured fidelity of delivery, and 22 measured engagement. Measures of fidelity of delivery included observation (n = 17; 38.6%), self‐report (n = 15; 34%), quantitatively rated qualitative interviews (n = 1; 2.3%), or multiple measures (n = 11; 25%). Measures of engagement included self‐report (n = 18; 39.1%), intervention records (n = 11; 24%), or multiple measures (n = 17; 37%). Fifty‐one studies (77%) reported at least one psychometric or implementation quality; 49 studies (74.2%) reported at least one psychometric quality, and 17 studies (25.8%) reported at least one implementation quality. Conclusion Fewer than half of the reviewed studies measured both fidelity of delivery of, and engagement with complex, face‐to‐face health behaviour change interventions. More studies reported psychometric qualities than implementation qualities. Interpretation of intervention outcomes from fidelity of delivery and engagement measurements may be limited due to a lack of reporting of psychometric and implementation qualities. Statement of contribution What is already known on this subject? Evidence of fidelity and engagement is needed to understand effectiveness of complex interventions Evidence of fidelity and engagement are rarely reported High‐quality measures are needed to measure fidelity and engagement What does this study add? Evidence that indicators of quality of measures are reported in some studies Evidence that psychometric qualities are reported more frequently than implementation qualities A recommendation for intervention evaluations to report indicators of quality of fidelity and engagement measures


Statement of contribution
What is already known on this subject?
Evidence of fidelity and engagement is needed to understand effectiveness of complex interventions Evidence of fidelity and engagement are rarely reported High-quality measures are needed to measure fidelity and engagement What does this study add? Evidence that indicators of quality of measures are reported in some studies Evidence that psychometric qualities are reported more frequently than implementation qualities A recommendation for intervention evaluations to report indicators of quality of fidelity and engagement measures Most interventions aimed at changing health behaviours are complex in that they contain multiple components (Campbell et al., 2000;Oakley et al., 2006). The effectiveness of face-to-face interventions depends on providers delivering the intervention as intended and participants engaging with the intervention. However, delivering interventions with fidelity of delivery and ensuring that participants engage with interventions are not easy to achieve (Glasziou et al., 2010;Hardeman et al., 2008;Lorencatto, West, Bruguera, & Michie, 2014;Michie et al., 2008). Furthermore, it is more difficult to ensure that complex interventions are delivered as intended and engaged with, than simple interventions (Dusenbury & Hansen, 2004;Greenhalgh et al., 2004).
To understand, and potentially improve, intervention effectiveness, it is necessary to measure the extent to which the intervention is delivered in line with the protocol ('intervention fidelity') and engaged with by participants. Although many conceptualizations of engagement have been proposed (Angell, Matthews, Barrenger, Watson, & Draine, 2014), in this review, the term 'participant engagement' is used as an umbrella term to encapsulate constructs of fidelity that relate to participants' engagement with intervention content. This includes whether participants understand the intervention, whether they can perform the skills required by the intervention ('intervention receipt'), and whether they use these skills in daily life ('intervention enactment') (Borrelli, 2011). In doing this, the review makes a clear distinction between providers' behaviours (fidelity of delivery) and participants' behaviours (engagement). Both fidelity of delivery and engagement are necessary to understand the effects of the intervention; if effects are not found, this may be due to low fidelity of delivery and/or engagement and is therefore not a test of the potential of the intervention components ('active ingredients') to bring about change (Borrelli, 2011;Durlak, 1998;Lichstein, Riedel, & Grieve, 1994).
Fidelity of delivery has been assessed by self-report measures (Bellg et al., 2004), and by audio-recording, which is considered to be the gold standard (Bellg et al., 2004;Borrelli, 2011;Lorencatto et al., 2014). Methods used to assess engagement include selfreport measures (Bellg et al., 2004;Burgio et al., 2001;Carroll et al., 2007), observation of skills (Burgio et al., 2001), and homework reviews (Bellg et al., 2004). Systematic reviews of measures used to monitor fidelity of delivery demonstrate that these measures have consistently been used in intervention research, in both educational (Maynard, Peters, Vaughn, & Sarteschi, 2013) and health settings (Rixon et al., 2016). For example, a review of 55 studies found that intervention receipt was mostly measured by assessing understanding and performance of skills (Rixon et al., 2016). Observational measures may provide a more valid representation of what is delivered than self-report measures (Breitenstein et al., 2010) and avoid social desirability bias (Schinckus, Van den Broucke, Housiaux, & Consortium, 2014). However, observation is likely to require more time and resources (Breitenstein et al., 2010;Schinckus et al., 2014), and it may also change the behaviour of those being observed (McMahon, 1987; as cited in Moncher & Prinz, 1991).
To understand which components have been delivered and engaged with, suitable measures are needed. Researchers suggest that measures should be psychometrically robust, with good reliability and validity (Gearing et al., 2011;Glasgow et al., 2005;Lohr, 2002;Stufflebeam, 2000). Reliability is defined as achieving consistent results in different situations (Roberts, Priest, & Traynor, 2006), and validity is defined as measurement of the construct it aims to measure . Previous reviews found that few studies reported information on the reliability or validity of fidelity or engagement methods. A systematic review of fidelity of delivery in after-school programmes found that no studies reported reliability (Maynard et al., 2013), and a systematic review of intervention receipt in health research found that 26% of studies reported on reliability and validity (Rixon et al., 2016). This makes it difficult for researchers to fully interpret the quality of measures and therefore the results of intervention outcomes. In this review, we use the term 'psychometric qualities' to refer to the quality of the measures. Aspects of 'psychometric qualities' of measures in the fidelity literature include the following: using multiple, independent researchers to rate fidelity of delivery; calculating inter-rater agreement of measurements; and randomly selecting data (Bellg et al., 2004;Borrelli, 2011;Breitenstein et al., 2010;Lorencatto, West, Seymour, & Michie, 2013).
It is also necessary to ensure that measures are easy to use in practice and to minimize missing responses, which are common in health care self-report research (Shrive, Stuart, Quan, & Ghali, 2006). Researchers suggest that practicality and acceptability influence the extent to which measures are used in practice (Glasgow et al., 2005;Holmbeck & Devine, 2009;Lohr, 2002). Practicality is defined as whether the measure can be used despite limited resources (Bowen et al., 2009), for example, being short and easy to use, and reducing participant and provider burden (Glasgow et al., 2005;Lohr, 2002). Acceptability is defined as whether the measure is appropriate for those who will use it (Bowen et al., 2009), for example, by including alternative forms and language adaptations, and by ensuring that measures are easy to interpret (Lohr, 2002). In this review, we use the term 'implementation qualities' to refer to descriptions of how the measures were implemented in practice. Aspects of 'implementation qualities' of measures in the fidelity literature include time constraints, cost, and reactions to measurements (Breitenstein et al., 2010).
Previous reviews have identified the measures used to monitor fidelity of delivery of after-school programmes (Maynard et al., 2013), evidence-informed interventions (Slaughter, Hill, & Snelgrove-Clarke, 2015), and the measures used to monitor intervention receipt in health care settings (Rixon et al., 2016). Furthermore, researchers have previously outlined some strengths and weaknesses of different measures of fidelity of delivery and engagement (e.g., Borrelli, 2011;Breitenstein et al., 2010;Moncher & Prinz, 1991). To the authors' knowledge, no systematic reviews have been conducted to identify the measures used to monitor fidelity of delivery and engagement (including intervention receipt and enactment), in complex, face-to-face health behaviour change interventions. This review will also extend previous research by describing the reporting of both psychometric and implementation qualities of these measures. Synthesizing the psychometric and implementation qualities of fidelity of delivery and engagement measures is needed to determine the quality of measures and how easy they are to implement. 'Health' includes physical, mental, and social well-being, as recommended by the World Health Organisation (WHO, 2017). This review aimed to: 1. Identify the types of measures used to monitor (1) the fidelity of delivery of, and (2) engagement with, complex, face-to-face health behaviour change interventions. 2. Describe these measures as reported in terms of both psychometric and implementation qualities.

Methods
The search and screening strategies were developed using the methods advocated by the Cochrane Collaboration Lefebvre, Manheimer, & Glanville, 2011). Eligibility criteria for considering studies were specified using the 'Participants', 'Intervention', and 'Outcomes' criteria from PICO (O'Connor, Green, & Higgins, 2011).
Inclusion criteria 1. Participants: Adults aged 18 and over. 2. Intervention: Complex, face-to-face behaviour change interventions aimed at improving health behaviours. Health is defined as physical, mental, or social wellbeing (WHO, 1946;as cited in WHO, 2017). Other modes of intervention delivery, such as digital interventions, may have different issues in relation to fidelity of delivery and engagement; therefore, these were not included in this review. 3. Outcomes: Studies which described measures to monitor fidelity of delivery and/or engagement and reported outcomes for fidelity of delivery and/or engagement and intervention effectiveness using quantitative measures. Only quantitative studies were included to increase the ability to compare across studies.
Exclusion criteria 1. Review articles, articles not written in English, or articles not peer-reviewed 2. Articles in which the intervention outcome could not be clearly distinguished from the engagement or fidelity of delivery outcome.

Search strategy
Five electronic databases (PubMed, ScienceDirect, PsycINFO, Embase, and CINAHL Plus) were searched from the inception of each database up to November 2015. Implementation Science was searched, and reference lists of relevant known reviews (Carroll et al., 2007;Durlak & DuPre, 2008;Toomey, Currie-Murphy, Matthews, & Hurley, 2015) were screened to identify additional studies. After the initial search, reference lists of reviews identified from the search (Clement, Ibrahim, Crichton, Wolf, & Rowlands, 2009;Conn, Hafdahl, Brown, & Brown, 2008;Gucciardi, Chan, Manuel, & Sidani, 2013;Reynolds et al., 2014;Smith, Soubhi, Fortin, Hudon, & O'Dowd, 2012), relevant protocols (Gardner et al., 2014), and forward and backward searching of included studies were screened to identify further articles. The articles generated by this search strategy were sent to 21 experts to ask whether they knew of relevant articles that were missing from the search results.
Initial search terms were piloted and refined iteratively with sequential testing to identify false-positive and false-negative results and ensure that the search captured all relevant keywords. A subject librarian was consulted in the development of the search terms.
Free and mapped searches (using Medical Subject Heading Terms) were conducted. Boolean operators were used to construct a search incorporating all search terms when combination searches were not possible. Search outputs were filtered for English full texts, peer-reviewed articles, adult participants and health topics. The final search strategy is in Appendix S1.
To access articles not available through the university library database, the authors were contacted or articles were accessed through library services.
This search strategy was not exhaustive, but was instead used to identify as many papers that measured and reported fidelity of delivery and/or engagement in sufficient depth to provide insight into the measures used.

Data collection and analysis Study selection
One reviewer conducted the electronic searches and screened the reference lists of relevant articles. All identified titles and abstracts were downloaded and merged using EndNote. Duplicates were removed. Two reviewers independently screened all (1) titles, (2) abstracts, and (3) full texts against inclusion and exclusion criteria. Reviewers met after each stage to determine agreement and resolve discrepancies. Any articles which reviewers were unsure of were retained until data extraction, when more information was available . Inter-rater reliability was assessed using percentage agreement and kappa statistics. Scores from both the initial search screening and additional search screening were combined to calculate agreement scores. For the title screening, researchers achieved 64.9% agreement (n = 802, two missing responses, kappa .49, PABAK .47). For the abstract screening, researchers achieved 68% agreement (n = 425, three missing responses, kappa .36, PABAK .36). For the full-text screening, researchers achieved 71.8% agreement (n = 266; kappa = .46 and PABAK = .58). The full-text kappa scores (Cohen, 1960) indicated fair agreement (Orwin, 1994; as cited in . This might reflect the difficulty identifying relevant articles due to differences in terminology in studies. Information on fidelity of delivery and engagement was often reported in separate articles than those reporting intervention outcomes.

Data extraction
A data extraction form was developed using a combination of standardized forms: Guidelines International Network-Evidence Tables Working Group intervention template (Guidelines International Network, 2002-2017 and the Oxford Implementation Index (Montgomery, Underhill, Gardner, Operario, & Mayo-Wilson, 2013). Data on the measures used to monitor fidelity of delivery and engagement and results were extracted, along with any qualities of measures that were reported. Psychometric qualities and implementation qualities were not pre-specified before data extraction; therefore, any information that was reported in the results and discussion section of the original articles in relation to the quality of the measures was extracted. As a minimum quality check (Centre for Reviews and Dissemination, University of York, 2009), an independent researcher checked 20% of data extraction forms. Minor errors of punctuation were identified; however, no further details were extracted, and therefore, one researcher extracted data from all studies.

Data synthesis
Narrative analysis was used to summarize the fidelity of delivery and engagement measures and the reporting of psychometric and implementation qualities by one researcher. If authors specified the type of engagement that they measured, for example, 'intervention receipt' or 'intervention enactment', these were reported separately within engagement. One researcher synthesized the information on methods. The extracts from the text that included descriptions of qualities were summarized, and the part of the procedure that the quality related to was recorded. Psychometric qualities included reliability (achieving consistent results in different situations; Roberts et al., 2006) and validity (measures what it aims to measure; Roberts et al., 2006). Implementation qualities included acceptability (appropriate for those who will use it; Bowen et al., 2009), practicality (can be used despite limited resources; Bowen et al., 2009), and cost. Researchers were open to other categories that may have emerged if qualities did not fit into these categories. Due to the heterogeneity of studies, a descriptive rather than quantitative synthesis of data was conducted (Deeks, Higgins, & Altman, 2008;Popay et al., 2006).
Two researchers were involved in the categorization of psychometric and implementation qualities. The first author coded 10% of the qualities and asked an independent researcher to check responses. Disagreements were identified, and both researchers independently coded an additional 10% of qualities. Researchers met after each round to discuss disagreements. This process was repeated, until 80% agreement on the categorization of features was reached, as recommended by Lombard, Snyder-Duch, and Bracken (2002). After four rounds (40% of qualities were independently coded), reliability was achieved with 80.1% agreement between coders. The first author coded the rest of the qualities, based on discussions with the second researcher. Following this, the second researcher checked a further 10% of the researcher's independent coding and any qualities that the first author was unsure how to code.

Results
After duplicates were removed, 809 records were identified. Sixty-six articles were included in the analysis ( Figure 1).

Measures
Measures of fidelity of delivery were categorized into observational measures (n = 17; 38.6%), self-report measures (n = 15; 34%), quantitatively rated qualitative interviews (n = 1; 2.3%), and multiple measures (n = 11; 25%). Of the studies that used multiple measures, six (14%) used at least one type of observational measure and nine (20.5%) used at least one type of self-report measure. In total, 23 (52%) studies used at least one type of observational measure and 24 (55%) used at least one type of self-report measure (see Table 1 for details).
Measures of engagement were categorized into self-report measures (n = 18; 39.1%); intervention records (n = 11; 24%), for example, attendance monitoring; and multiple measures (n = 17, 37%). Of the studies that used multiple measures, 15 (32.6%) used at least one type of self-report measure. In total, 33 (76.7%) studies used at least one type of self-report measure (see Table 1 for details). Two studies reported measuring receipt and enactment 6,39 , and one study reported measuring receipt 14 only.
The majority of studies (fidelity of delivery, n = 31; 70.45%; engagement, n = 42; 91.3%) did not report whether they developed their own measure or used a previously developed measure. For fidelity of delivery, eight (18.18%) used a previously developed measure and five (11.36%) developed their own measures. For engagement, three (6.5%) studies used previously developed measures and one (2.2%) developed own measures and used measures that were previously developed.
Many studies did not specify the type of scales used to quantify fidelity of delivery (n = 23; 52.3%) or engagement (n = 29; 63%). For fidelity of delivery, 12 studies (27.3%) reported using rating scales (which ranged from 3-point scales to 10-point scales), eight (18.2%) reported using dichotomous scales and one (2.3%) used rating scales and dichotomous scales. For engagement, 12 studies (26.1%) reported using rating scales (which ranged from 3-point scales to 10-point scales), three (6.5%) reported using dichotomous scales, and two (4.4%) reported using a combination of rating scales and dichotomous scales.
For both fidelity of delivery (n = 23; 52.3%) and engagement (n = 45; 97.8%), many studies did not specify how many participants they sampled. Five (11.4%) measured fidelity of delivery of all participants and 16 (36.4%) measured fidelity of delivery in a subsample of participants. Of those studies that measured fidelity of delivery in a subsample, four reported the number of sessions that they sampled, four reported the number of clinicians/sites data were sampled from, six reported the percentage of sessions that they sampled, and two did not specify how many but reported sampling some but not all participants. One (2.2%) study reported measuring engagement in a subsample of participants.
The majority of studies did not specify whether they measured fidelity of delivery (n = 38; 86.4%) or engagement (n = 35; 76.1%) in all conditions; therefore, it is likely they measured the intervention group only. Four (9.1%) reported measuring fidelity of delivery in all intervention groups, and two (4.5%) reported measuring fidelity of delivery in the intervention group only. Nine (19.6%) reported measuring engagement in all intervention groups, and two (4.3%) reported measuring engagement in the intervention group only.

Reporting of psychometric and implementation qualities Studies
Of all included studies, 51 (77%) reported at least one psychometric or implementation quality of their measures (38 fidelity of delivery; 86.4%, 23 engagement; 50%). Forty-nine studies (74.2%) reported at least one psychometric quality, and 17 studies (25.8%) reported at least one implementation quality (see Table 2 for details).

Psychometric and implementation qualities
In total, 261 (100%) reported qualities were identified (see Table 3 for details). Of these, 215 (82.4%) psychometric qualities were reported, 41 (15.7%) implementation qualities, and five (1.9%) both psychometric and implementation qualities; 213 qualities were reported in relation to fidelity of delivery measures and 58 qualities for engagement measures.
The most frequently reported psychometric qualities concerned the use of multiple researchers (n = 21: 3 data collection, 2 data analysis, 1 data entry, 3 develop measures, 11 coding, 1 validate coding frame), the validity of measures (n = 17: 9 valid, 8 not valid), the use of independent researchers (n = 16: 14 used independent researchers, 2 did not use independent researchers), reliability of measures (n = 11: 5 reliable, 6 not reliable), the random selection of data (n = 11: 9 randomly selected data, 2 did not randomly select data), and inter-rater agreement (n = 9: 3 high inter-rater agreement, 2 did not report inter-rater agreement, 2 poor to fair, 1 fair to excellent, 1 no coder drift). Please see Table 4 for a detailed list of all psychometric qualities.
The most frequently reported implementation qualities concerned resource challenges (n = 10: 1 sharing Dictaphones, 4 time restrictions, 2 financial restrictions, and 3 technical difficulties) and providers' attitudes (n = 7: 1 dislike paperwork, 1 fear of discouraging participants, 1 nerves, 1 report participants behaving differently, 1 positive attitudes, 1 additional work) (see Table 4 for a list of all qualities).

Key findings
Fewer than half of the reviewed studies measured both fidelity of delivery of and engagement with complex, face-to-face health behaviour interventions. Measures covered observation, self-report, and intervention records. Whilst 73% reported at least one psychometric quality, only 26% reported at least one implementation quality.  (100) Note. The fidelity of delivery and engagement columns do not add up to 261 because 10 qualities were reported for both fidelity of delivery and engagement.

Review of fidelity and engagement measures 891
How findings relate to previous research The measures used to measure fidelity of delivery of, and engagement with, complex, face-to-face health behaviour change interventions were consistent with previous recommendations of using observational or self-report measures to monitor fidelity of delivery, and self-report measures to monitor engagement (Bellg et al., 2004;Borrelli, 2011;Burgio et al., 2001;Carroll et al., 2007;Schinckus et al., 2014). A similar percentage of studies used observational and self-report measures to measure fidelity of delivery, despite observational measures being recommended as the gold-standard measure and the reported limitations of self-report measures (Bellg et al., 2004;Borrelli, 2011;Breitenstein et al., 2010;Lorencatto et al., 2014;Schinckus et al., 2014). Intervention records (e.g., attendance or homework) were also used to measure engagement. Intervention records can be considered an objective measure of receipt (Gearing et al., 2011;Rixon et al., 2016) and participation (Saunders, Evans, & Joshi, 2005). However, these measures are limited by their inability to monitor how much participants understand and use the intervention. Other recommended and potentially more objective measures, for example, asking participants to demonstrate skills (Burgio et al., 2001), were not adopted by any study in this review. Perhaps these findings demonstrate that measures need to be easy to use and acceptable to respondents and researchers in order to be selected for use. This explanation is consistent with previous studies which suggest that observational measures are perceived to be more expensive, time-consuming and difficult to use (Breitenstein et al., 2010;Schinckus et al., 2014). Many studies used measures of fidelity of delivery and engagement specific to one intervention, and therefore, generalizability is limited (Breitenstein et al., 2010). This review found that three quarters of studies reported at least one quality of their measures. This finding demonstrates that the reporting of psychometric qualities in the complex, face-to-face health behaviour change interventions included in this review, may not be as infrequent as previously suggested in different populations (Baer et al., 2007;Breitenstein et al., 2010;Maynard et al., 2013;Rixon et al., 2016). However, not all studies reported psychometric qualities, and fewer reported implementation qualities, despite the importance of psychometric and implementation qualities (Gearing et al., 2011;Glasgow et al., 2005;Holmbeck & Devine, 2009;Lohr, 2002;Stufflebeam, 2000). The reporting of psychometric and implementation qualities provides information which allows the reader to determine whether the findings are trustworthy and representative. Given this, it is difficult to draw conclusions with high certainty about how well interventions have been delivered or engaged with. This, in turn, makes it difficult to draw conclusions about intervention effectiveness.
The psychometric qualities that were most frequently reported were those recommended by previous research; examples of these are the use of multiple, independent researchers to reliably rate a random percentage of sessions for fidelity of delivery (Bellg et al., 2004;Borrelli, 2011;Lorencatto et al., 2014). However, some qualities which are recommended by research were not frequently reported; an example of this is routine audio-recording (Gresham, Gansle, & Noell, 1993;Miller & Rollnick, 2014). The implementation qualities that were most frequently reported were those concerning resources (including time constraints, financial constraints, and technical difficulties) and providers' attitudes towards measures. These findings could explain why missing responses were reported in some of the studies included in this review (Arends et al., 2014;Chesworth et al., 2015;Dubbert, Cooper, Kirchner, Meydrech, & Bilbrew, 2002;Thyrian et al., 2010) and health care research (Shrive et al., 2006). Providers may not return audio-recordings (Weissman, Rounsaville, & Chevron, 1982) or checklists, if they feel uncomfortable with audio-recording or if they are overwhelmed with paperwork.

Limitations
The aim of this review was to identify a range of studies that met the criteria and reported fidelity of delivery and/or engagement in enough depth to be able to draw conclusions about the reporting of fidelity of delivery and/or engagement measures. To identify as many studies as possible, a comprehensive search was conducted, which included contacting experts and authors to identify further relevant articles that may have been missed by the search strategy. However, we will not have identified articles that did not report monitoring fidelity of delivery or engagement in titles, abstracts, or keywords. A further reason why relevant articles may have been missed is that many terms are used interchangeably in fidelity research and we may not have captured all of these terms in the search strategy. We only included articles that reported a clear fidelity of delivery or engagement measure or outcome. As is the case with many systematic reviews, the search is inevitably limited to its date cut-off. However, future use of natural language processing, ontologies, and machine learning (Larsen et al., 2016) will enable more ongoing updating when aggregating review evidence (see www.humanbehaviourcha nge.org).
The findings from this review consider the reporting of qualities and not the actual quality of measures. The review findings do not consider strengths or weaknesses of these qualities nor how much weighting should be given to each quality when designing fidelity of delivery and engagement measures. This is an area that could be investigated, building on the current review.

Implications
There are three main implications of these review findings for researchers and intervention developers: 1. The need to fully report details of fidelity of delivery and engagement measures. The findings from this review demonstrated that many studies did not specify details about the sampling or analysis method used in developing measures of fidelity of delivery and or engagement. If this information is not available, evaluation and replication are difficult to achieve. 2. The need to report both psychometric and implementation qualities for fidelity of delivery and engagement measures. The reporting of psychometric and implementation qualities would be helpful to researchers who are aiming to measure fidelity of delivery or engagement. This information would allow evaluations of what measures and procedures may be feasible. 3. The need to develop high-quality measures of fidelity of delivery and engagement that are acceptable and practical to use but also reliable and valid. Both psychometric and implementation qualities of measures are relevant when selecting, developing, and reporting measures.
If implemented, these steps could help to strengthen the quality of fidelity of delivery and engagement data and the interpretation of intervention effectiveness.

Future research
Further research is needed to evaluate the importance and weighting of each quality when designing fidelity of delivery and engagement measures. One way to do this could be to conduct a Delphi study with experts in intervention fidelity and engagement. This systematic method could be used for building a consensus (Hsu & Sandford, 2007) regarding which psychometric and implementation qualities are most important, and which qualities should be given the most weighting when developing and evaluating fidelity of delivery and engagement measures. This information could then to be used to inform the development of measures of fidelity of delivery and engagement that are reliable, valid, acceptable, and practical. Future systematic reviews could explore the qualities of fidelity and engagement measures reported in qualitative studies.

Conclusion
Fewer than half of the reviewed studies measured both fidelity of delivery of and engagement with complex, face-to-face health behaviour change interventions. Measures covered observation, self-report, and intervention records. Whilst 74% reported at least one psychometric quality, only 26% reported at least one implementation quality. Findings suggest that implementation qualities are reported less frequently than psychometric qualities. The findings from this review highlight the need for researchers to report measures of fidelity of delivery and engagement in detail, to report psychometric and implementation qualities, and to develop, use, and report high-quality measures. This would strengthen the quality of fidelity of delivery and engagement data and the interpretation of intervention effectiveness.