The development, reliability, and validity of the Facilitator Assessment Tool: An implementation fidelity measure used in Parenting for Lifelong Health for Young Children

Abstract Background The Parenting for Lifelong Health for Young Children (PLH‐YC) programme aims to reduce violence against children and child behaviour problems among families in low‐ and middle‐income countries (LMICs). Although the programme has been tested in four randomised controlled trials and delivered in over 25 countries, there are gaps in understanding regarding the programme's implementation fidelity and, more generally, concerning the implementation fidelity of parenting programmes in LMICs. Aims This study aims to address these gaps by examining the psychometric properties of the PLH‐YC‐Facilitator Assessment Tool (FAT)—an observational tool used to measure the competent adherence of PLH‐YC facilitators. Examining the psychometric properties of the FAT is important in order to determine whether there is an association between facilitator competent adherence and programme outcomes and, if correlated, to improve facilitator performance. It is also important to develop the implementation literature among parenting interventions in LMICs. Methods The study examined the content validity, intra‐rater reliability, and inter‐rater reliability of the FAT. Revision of the tool was based on consultation with programme trainers, experts, and assessors. A training curriculum and assessment manual was created. Assessors were trained in Southeastern Europe and their assessments of facilitator delivery were analysed as part of a large‐scale factorial experiment (N = 79 facilitators). Results The content validity process with PLH‐YC trainers, experts, and assessors resulted in substantial improvements to the tool. Analyses of percentage agreements and intraclass correlations found that, even with practical challenges, assessments were completed with adequate yet not strong intra‐ and inter‐rater reliability. Conclusions This study contributes to the literature on the implementation of parenting programmes in LMICs. The study found that the FAT appears to capture its intended constructs and can be used with an acceptable degree of consistency. Further research on the tool's reliability and validity—specifically, its internal consistency, construct validity, and predictive validity—is recommended.

Conclusions: This study contributes to the literature on the implementation of parenting programmes in LMICs.The study found that the FAT appears to capture its intended constructs and can be used with an acceptable degree of consistency.Further research on the tool's reliability and validity-specifically, its internal consistency, construct validity, and predictive validity-is recommended.

| Parenting for Lifelong Health
There is considerable evidence that parenting programmes increase positive parenting and parent-child relationships as well as reduce child maltreatment and behaviour problems (e.g., Chen & Chan, 2016;Furlong et al., 2013).One such programme for families with children across the developmental spectrum is the Parenting for Lifelong Health for Young Children programme (PLH-YC), which was originally developed in South Africa by individuals from the Universities of Bangor, Cape Town, and Oxford in collaboration with the World Health Organization, UNICEF, and Clowns Without Borders South Africa along with input from parents and practitioners (Lachman, Sherr, et al., 2016).PLH-YC is a group-based parenting programme for parents of children ages 2 to 9 years rooted in social learning theory and behaviour change principles (e.g., goal setting and discussing progress on goals) (Lachman, Sherr, et al., 2016;Michie et al., 2015;Ward et al., 2014).The programme ranges from five to 12 sessions in length and uses participatory, nondidactic, and strengths-based approaches to empower parents to develop skills in fostering positive relationships, handling conflict and emotions, and employing effective disciplinary approaches (Lachman, Sherr, et al., 2016).Parenting for Lifelong Health (PLH) programmes are now being implemented in over 25 countries in Africa, Asia, and Southeastern Europe.To date, four randomised controlled trials (RCTs) have examined the effectiveness of PLH-YC-two in South Africa, one in the Philippines, and one in Thailand (Gardner et al., forthcoming;Lachman et al., 2017Lachman et al., , 2021;;Ward et al., 2020).These studies found improved positive parenting and child behaviour, while reducing family violence.Further information about PLH and its growing evidence base is available in numerous published papers, protocols, and resources (e.g., Martin, Lachman, et al., 2021;Shenderovich et al., 2020;World Health Organisation [WHO], 2020).As the evidence supporting parenting programmes such as PLH is positive and substantial, studying their implementation fidelity represents an important way to explore which programme elements are correlated with outcomes and how to improve delivery.

| Measures of competent adherence
Many parenting programmes have developed programme-specific tools to measure implementation fidelity (e.g., Martin, Steele, et al., 2021).This paper focuses on a measure developed for and used in PLH-YC to assess two aspects of implementation fidelitycompetence and adherence, which together capture the skill and diligence with which facilitators deliver intervention components (Fixsen et al., 2005).Assessing competent adherence has numerous research and practical benefits, including providing an objective assessment of the extent to which a particular parenting programme is implemented as planned.These assessments allow researchers to examine whether facilitator delivery is associated with parent and/or child outcomes, understand what makes an effective facilitator, and indicate where to target programme improvements (Forgatch et al., 2005).

Key messages
• The Facilitator Assessment Tool (FAT) is an observational implementation measure used to assess the competent adherence with which programme facilitators deliver the Parenting for Lifelong Health-Young Children (PLH-YC) programme.
• As PLH-YC is being delivered widely in many countries, it is necessary to establish the tool's psychometric properties to provide confidence that the competent adherence of facilitators is consistently and accurately assessed.
• This study analysed the content validity, intra-rater reliability, and inter-rater reliability to provide initial evidence of the tool's ability to capture its intended constructs and be used consistently by and between assessors.
• Further analyses of the FAT's reliability and validity (internal consistency, construct validity, and predictive validity) are warranted and would provide additional evidence as to the strength of this tool for measuring facilitator competent adherence.
A systematic review by Martin, Steele, et al. (2021) identified 65 measures of competent adherence employed in 63 parenting programmes.Parenting programmes differ according to content and delivery, and this variety is reflected in the number of fidelity tools available.Measures also vary in other respects: Some use observational, nonobservational, or a combination of methods; some are structured with dichotomous or Likert scale items; and all capture one or more aspects of competent adherence.

| PLH facilitator training and assessment
Facilitators-typically community members and professionals including teachers, psychologists, and social workers-receive PLH-YC training through a 5-day workshop (approximately 30 h).Training is provided by Clowns Without Borders South Africa (CWBSA), a nonprofit serving as a capacity building agency for the dissemination of PLH.In training, facilitators learn to implement programme activities and skills, including using group discussions on parenting challenges, illustrated comics to identify and model parenting skills, and group activities to practice skills (Lachman, Cluver, et al., 2016).Following training, facilitators deliver PLH-YC and receive regular supervision from coaches.
The PLH-YC Facilitator Assessment Tool (FAT) was developed by study investigators and programme developers to assess the quality with which facilitators deliver programme-specific activities and techniques.

| PLH facilitator assessment procedure
The FAT assessment procedure is similar to other measures of facilitator quality of delivery, including measures used by Parent Management Training-Oregon Model (PMTO) (Holtrop et al., 2020;Knutson et al., 2019) and Incredible Years (IY) (Eames et al., 2008).Like the PMTO and IY tools, the FAT uses observational methods (live or video-recorded) to assess facilitator delivery of entire programme sessions.Observational methods are used since these are considered more objective than nonobservational methods such as facilitator selfreports, which could be prone to social desirability bias (Eames et al., 2008;Stone et al., 1999).However, observational assessments are often more resource-intensive and may be susceptible to reactivity bias (Girard & Cohn, 2016).Training of FAT assessors is approximately 14 h and includes theoretical and practical components following a training curriculum, assessor manual, and coding matrix.As the training is conducted in low-income settings, fewer financial resources are dedicated to training assessors than in high-income settings.

| FAT
The FAT is composed of two subscales (50 items).The Activities Subscale assesses facilitator adherence to three key PLH-YC activities (24 items)-home activity discussion (10 items; e.g., "identify specific challenges when shared by at least one parent"), illustrated story discussion (seven items; e.g., "discuss possible solutions for negative stories"), and group practice activity (seven items; e.g., "debrief with the participants about experiences and feelings").The Skills Subscale assesses facilitator competence in delivering key PLH-YC process skills (26 items)-modelling parenting behaviours (seven items, e.g., "give positive, specific, and realistic instructions"), demonstrating collaborative facilitation (seven items; e.g., "accept participant responses verbally by reflecting back what the participant says"), encouraging participation (seven items; e.g., "participants appear comfortable and involved in the session"), and utilising leadership skills (six items, e.g., "use open-ended questions during group discussions").
Items are rated using a 4-point Likert scale (0 = inadequate, 1 = needs improvement, 2 = good, 3 = excellent).Items are summed to produce an impression score represented as a percentage.

| Current study
Although there is growing evidence of the effectiveness of PLH-YC, there are gaps in understanding regarding its implementation quality and, more specifically, concerning facilitator competent adherence.As there was not a suitable measure of competent adherence available in the literature, this study describes the development of the tool used to assess PLH-YC facilitators and provides preliminary psychometric evidence on the tool using data from the delivery of PLH-YC in North Macedonia, the Republic of Moldova ("Moldova"), and Romania as part of the RISE study.RISE is a collaboration seeking to evaluate the effectiveness and costs of PLH-YC (Frantz et al., 2019;Lachman et al., 2019).As a result, this study aimed to answer the following research question: what is the content validity, intra-rater reliability, and inter-rater reliability of the FAT as a measure of PLH-YC facilitator competent adherence in Southeastern Europe?As the first study on the FAT's psychometric properties, these indices were selected to examine whether the tool shows promise for further psychometric analyses or needs revision.

| METHODS
This paper examined the content validity, intra-rater reliability, and inter-rater reliability of the FAT in Southeastern Europe using COS-MIN recommendations (Mokkink et al., 2010).Content validity was examined by consulting with three stakeholder groups and revising the FAT accordingly.Intra-and inter-rater reliability was examined by determining the degree to which assessors used the FAT consistently over time as well as with each other.

| Participants
This study involved three stakeholder groups.First, during the content validity process, we consulted with eight certified trainers from CWBSA with at least 2 years of experience conducting FAT assessments in numerous countries.Second, we sought the advice of three parenting programme experts.These experts were consulted due to their extensive knowledge of PLH-YC and on conducting facilitator assessments in both research and practice.Third, those who conducted facilitator assessments ("assessors") supported our evaluation of all three psychometric properties.The assessors were 11 trained coaches involved in RISE (n = 5 in Moldova, n = 3 in North Macedonia, and n = 3 in Romania).The Moldovan assessors had a range of experiences and backgrounds supporting vulnerable families, including teaching and family therapy.The North Macedonian assessors were later career psychologists with experience in conducting assessments similar to the FAT.The Romanian assessors were early career psychologists based at a local university.The Moldovan and Romanian assessors also had prior experience as PLH-YC facilitators.

| Content validity
The FAT's content validity-the extent the tool appears to capture the intended constructs-was examined by gathering and synthesising the perspectives of the three aforementioned stakeholder groups on the measure's comprehensiveness, relevance, and comprehensibility (Mokkink et al., 2010;Terwee et al., 2018).First, we held a content validity workshop with CWBSA trainers, wherein detailed field notes were taken.The trainers were asked to describe their use of the FAT; share their perspectives on the tool's comprehensiveness, comprehensibility, and relevance; and suggest revisions to improve the tool's util-

| Intra-rater reliability
Intra-rater reliability, or assessor consistency, was examined by having each assessor observe a video recording of a facilitator delivering the programme twice with assessments conducted 3 weeks apart (Gwet, 2014;Heinl et al., 2016).As a result, a video from one facilitator per country was selected.It was only possible to assess one facilitator per country due to time and resource constraints.FAT scores were compared by calculating percentage agreements and intra-class correlations (ICCs) for each assessor and subscale (Margolin et al., 1998).Percentage agreements were selected because they indicate the ratio of instances wherein assessors chose the same rating.
Agreement levels above 70% were acceptable (Aspland & Gardner, 2003).ICCs were also examined to take chance agreement and correlation into account (Bruton et al., 2000;Koo & Li, 2016).A two-way mixed-effects model with an absolute agreement definition and single-rater type was used (McGraw & Wong, 1996;Shrout & Fleiss, 1979).ICCs were interpreted within a 95% confidence interval where ICCs under 0.50 were considered poor, between 0.50 and 0.75 were moderate, between 0.75 and 0.90 were good, and above 0.90 were excellent (Koo & Li, 2016).Percentage agreements and ICCs were calculated using the "irr package" in R (Gamer et al., 2017).As all items of the FAT should have been completed, mean imputation was used to take missing data into account when less than 10% of the data was missing (Watkins, 2018).

| Inter-rater reliability
Inter-rater reliability, the degree to which different coders similarly assess facilitator delivery (Chen & Krauss, 2004;Cho, 1981), was examined by having assessors observe video recordings of the same three facilitators selected out of 31 possible facilitators in Moldova, 16 in North Macedonia, and 32 in Romania (Hallgren, 2012).Thus, videos from a total of nine facilitators were used.It was only possible to assess three facilitators per country due to time and resource constraints.The data were analysed using the same methods as the assessments of intra-rater reliability.

| Content validity
The content validity consultations produced recommendations to improve the FAT.CWBSA trainers made four recommendations: break up complex items into separate, simple items; use specific definitions for each item and Likert point; add items to capture missing activities and skills; and remove redundant items.
After revisions based on trainer feedback, the FAT was shared with the parenting programme experts.Rooted in evidence linking praise and reflexive statements to participant outcomes (Eames et al., 2010), the experts recommended adding three items to measure the frequency of specific and unspecific praise (i.e., expressing approval and appreciation) and reflexive statements (i.e., reiterating participant contributions).
The further revised tool was then shared with PLH-YC coaches during assessment training.Initially, assessors recommended changes to item wording and suggested examples to include in the definitions.
After using the tool, the assessors provided further insight, indicating which items were difficult to understand and how they could be improved.This process helped ensure clear differences between points on the Likert scale.
Following assessor input, the revised FAT was finalised, resulting in 62 items with 26 items in the Activities Subscale, 33 items in the Skills Subscale, and three items in the Frequency Subscale.A summary of the recommendations and changes made is provided in Table 1.

| Intra-and inter-rater reliability
In terms of intra-rater reliability, the overall percentage agreements across the three countries ranged from 57.6% to 91.5% with ICCs of 0.52-0.94(Tables 2 and 3).Subscale percentage agreements ranged from 46.2% to 92.3% with ICCs of 0.32-0.96on the Activities Subscale, 45.5% to 90.9% with ICCs of 0.13-0.93 on the Skills Subscale, and 0.0% to 100.0% on the Frequency Subscale with ICCs of À0.79-1.00.
At the country-level, overall percentage agreements ranged from 57.6% to 89.9% with ICCs of 0.52-0.94 in Moldova, 78.0% to 88.1% with ICCs of 0.82-0.90 in North Macedonia, and 66.1% to 91.5% with ICCs of 0.79-0.94 in Romania.There was no missing intra-rater reliability data.
In terms of inter-rater reliability, the overall percentage agreements ranged from 18.1% to 74.0% with ICCs of 0.49-0.91

CWBSA Trainers
They recommended that the item, "Did the facilitator accept participant responses verbally and physically?" be divided into four items to capture whether the facilitator demonstrated physical acceptance (e.g., nodding), verbal acceptance (e.g., "mhm"), openness (e.g., "Interesting suggestion!"), and use of a reflexive statement (e.g., "Am I understanding you to say that you will schedule daily time to play with your child?").
Use specific definitions for each item and Likert point

CWBSA Trainers
Add items to capture missing activities and skills CWBSA Trainers "Did the facilitator identify core building blocks connected to the story?" was added to the illustrated story items in the Activities Subscale.

Remove redundant items CWBSA Trainers
The trainers recommended deleting the item, "Did the facilitator provide frequent praise throughout the discussion?" since praise was already incorporated into many questions.Note: ICC = intra-class correlation, 95% CI = 95% confidence interval; model: two-way mixed effects, type: single rater, definition: absolute agreement; three assessors and three facilitator assessments were used.missing as it was not possible to assess some items (e.g., an opportunity did not arise for the facilitator to deliver a programme component).These items should have been scored as each FAT item captures an aspect of PLH-YC that should be delivered.

| DISCUSSION
This study analysed three psychometric properties of the FAT.The content validity process resulted in a revised FAT that was more understandable, specific, and practical due to stakeholder recommendations to the FAT's items and assessment procedure.The analysis of assessor intra-rater reliability found somewhat acceptable levels of consistency (overall percentage agreements ranged from 57.6% to 91.5% with ICCs of 0.52-0.94and assessor inter-rater reliability ranged from 18.1% to 74.0% with ICCs of 0.49-0.91).Although assessors did not always achieve consensus, the ICCs were, with few exceptions, larger than the percentage agreements and most exceeded the suggested cut-off of ICC > 0.50 for moderate levels of reliability (Stemler & Tsai, 2008).This finding suggests that assessors were largely consistent in their application of the measure and its items, yet it was still difficult for assessors to achieve consensus on many occasions.It was particularly difficult for assessors from all countries to achieve intra-and inter-rater reliability on the Frequency Subscale.
Negative ICCs were found in multiple instances, indicating more within variance than between.As a result, this subscale was not rated reliably and should be dropped or modified.

| Contextual factors
The intra-and inter-rater reliability results are encouraging in light of

| Limitations and strengths
In addition to challenges delivering the training, this study had limitations.First, the study used a small sample (Koo & Li, 2016), necessitating caution in interpreting study results.Second, due to the study's limited scope, we used a purposive selection of sessions for assessment instead of random selection.This may have resulted in the assessment of nonrepresentative facilitator delivery (Walton et al., 2017).Third, missing data may have compromised results.
Future revisions of the FAT should ensure that assessors complete all items using strategies such as highlighting the importance of completing all items in assessor training and establishing monitoring and evaluation processes wherein FAT forms are checked for completion.
Despite limitations, this study makes an important contribution to our understanding of the psychometric properties of an assessment tool widely used to assess facilitators of an evidence-based parenting programme in LMICs.While valuable in all contexts, the need for practical, reliable, and valid assessment tools in low-income contexts is particularly heightened.The study found sufficient intra-and interrater reliability despite challenges encountered during training.Note: ICC = intra-class correlation, 95% CI = 95% confidence interval; model: two-way mixed effects, type: single rater, definition: absolute agreement; three assessors and three facilitator assessments were used; missing: 24 items.
However, the findings suggest that more attention should be paid to how reliability can be strengthened and explore whether those with certain backgrounds and skills would be better suited to conducting reliable assessments.For instance, future assessor training could require assessors to achieve an acceptable minimum level of intraand inter-rater reliability prior to conducting FAT assessments in practice.Further, two common issues in studying observational measures were avoided.First, facilitator reactivity to assessment was minimised because all programme sessions were recorded, thereby decreasing the likelihood that facilitators performed differently for assessments (Kazdin, 1982).Second, assessors did not evaluate facilitators they supervised to ensure assessor independence from the results (Walton et al., 2017).

| CONCLUSION
The  What does the facilitator need to improve? Recommendations: ity and accuracy.Their feedback was used to modify the FAT and create an initial training curriculum and manual.Second, we consulted the three parenting programme experts, who provided feedback on an updated version of the FAT and training manual.Third, during assessor training, RISE study assessors provided input on the FAT's comprehensiveness and relevance.It is important to note that the assessor training was not delivered to completion in North Macedonia and Romania due to scheduling conflicts, which may have compromised assessment reliability.
Create the frequency subscaleParenting Programme ExpertsThree additional items were added to the FAT: "Please record the number of discrete times the facilitator used reflexive statements and praise (specific/unspecific) during first twenty minutes of the home activity discussion: (1) reflexive statements, (2) specific praise, and (3) unspecific praise."Changes to item wording RISE AssessorsThe item "Accepts parent responses physically" was changed to "Uses body language to show acceptance challenges encountered during assessor training and varying levels of assessor experience.The training was not delivered to completion in North Macedonia and Romania due to scheduling conflicts, resulting in approximately 80% of the training being delivered in each of these contexts.In North Macedonia and Romania, three and two assessors respectively missed several hours of training due to other organisational commitments scheduled at the same time.Aside from Romania, the training relied on translation into local languages, which may have led to an imprecise understanding of assessment procedures.Further, there may be different levels of reliability across countries due to differences in assessment experience.Results may have been stronger in North Macedonia as these assessors were experienced psychologists accustomed to completing assessments similar to the FAT.In contrast, the Moldovan assessors, where reliability was lower, were less experienced practitioners from non-psychology backgrounds.Thus, future training should take prior experience into consideration and provide additional support for those with less experience.Furthermore, future research should explore what assessor training, characteristics, skills, and ongoing supports are required for reliably conducting facilitator assessments.

CONNECT 8 .
Connect experiences to the building blocks from session 9. Connects individual experiences to universal principles PRACTICE 10.Identify opportunities to practice skills (in addition to the structured group practice) room in a way that encourages equal and active participation 2. Facilitator is situated within the group, is at the level of the parents, and in a different place than the co-facilitator 3. Parents appear engaged in session 4. Parents appear comfortable and satisfied 5. Parents share views and opinions 6. Parent-facilitator participation ratio 7. Assure equal and active participation among parents 8. Targeted engagement of quiet or non-participating parents 9. Limiting parent responses 10.Keep parents on topic of discussion 11.Demonstrate knowledge of session content 12. Level of self-confidence with session content 13.Help parents generate their own ideas regarding principles or solutions to challenges core activities = (A/B) Â 100% % Total percent score core skills = (C/D) Â 100% % What are the facilitator's strengths?
6-89.9% (mean: 75.3%) 46.2-92.3%(mean:74.6%)45.5-90.9%(mean:75.8%)0.0-100.0%(mean:60%)(Table4-6).The overall percentage agreement across the three assessments was 18.1% (activities: 29.5%, skills: 9.1%, frequency: 0.0%) in Moldova with an ICC of 0.62 (activities: 0.71, skills: 0.37, frequency: 0.52), 74% (activities: 71.8%, skills: 75.8%, frequency: 11.1%) in North Macedonia with an ICC of 0.91 (activities: 0.93, skills: 0.86, and frequency: 0.95), and 32.8% (activities: 38.7%, skills: 33.3%, frequency: 22.2%) in Romania with an ICC of 0.49 (activities: 0.65, skills: 0.50, frequency: 0.84).There was no missing inter-rater reliability data in Macedonia and a low amount of missing data in Moldova (0.11%)and Romania (4.30%).Assessors indicated that some data were T A B L E 3Note: ICC = intra-class correlation, 95% CI = 95% confidence interval; model: two-way mixed effects, type: single rater, definition: absolute agreement, five assessors and three facilitator assessments were used, missing: frequency data for assessment 1.T A B L E 5 North Macedonia inter-rater reliability data by assessment results of this study suggest that the Facilitator Assessment Tool appears to capture the competent adherence with which facilitators deliver Parenting for Lifelong Health for Young Children.Future research is necessary to strengthen the reliability of the measurement due to results suggesting sufficient though not high levels of intra- and inter-rater reliability.We also recommend further research on the FAT's psychometric properties.These include research on its internal consistency to determine whether Cronbach's alphas and omegas indicate that items are appropriately associated with each other, con-