Inclusive critical appraisal of qualitative and quantitative findings in evidence synthesis

A diversity of approaches for critically appraising qualitative and quantitative evidence exist and emphasize different aspects. These approaches lack clear processes to facilitate rating the overall quality of the evidence for aggregated findings that combine qualitative and quantitative evidence. We draw on a meta‐aggregation of implementation and process evaluations to illustrate a method for critically appraising empirical findings generated from qualitative and quantitative studies. This method includes a rubric for standardizing assessments of the overall quality of evidence in an evidence synthesis or mixed‐method systematic review. The method first assesses the credibility of each finding extracted from a study. These individual assessments then feed into an overall score for any synthesized finding generated from the meta‐aggregation. We argue that this approach provides a balanced and inclusive method of critical appraisal by first assessing individual findings, rather than studies, using flexible criteria applicable to a range of primary study methods to derive an overall assessment of synthesized findings.


Highlights
What is already known • A variety of approaches exists for evaluating the strength of mixed evidence in syntheses.

What is new
• Adapted from the H-index, a framework that allows for inclusive appraisal of mixed evidence and flexibility based on synthesis goals.

Potential impact for researcher's outside authors
Researchers from any field can adapt the appraisal tool to evaluative criteria relevant for their methodology, field conventions, and synthesis objectives.The framework accounts for the dynamism of evolving criteria within and across fields.

| INTRODUCTION
As part of a meta-aggregation of process evaluations on juvenile drug treatment courts (JDTCs), we developed a method for critically appraising qualitative findings from qualitative, quantitative, and mixed-methods studies.Our approach can serve as a framework for developing a standardized assessment of qualitative findings generated from different methodological approaches.Yet we do not position our approach as a method to replace existing frameworks or assessments.Instead, we add to the diversity of approaches, calling attention to the methodology we created, which was an adaptation of calculating a h-index.We were less concerned with the orientation of the primary researcher and focus of the primary study.
Our objective was to synthesize mixed evidence and develop an inclusive rubric for appraising mixed evidence.The synthesis approach we adopted, metaaggregation, focused on synthesizing the various barriers and facilitators to implementing a juvenile drug treatment court from qualitative and quantitative studies.We judged the strength of the evidence relative to our goals.A low or high appraisal simply meant that the evidence had high or low value within our framework.We sought to establish the credibility of individual study findings and the synthesized findings across studies.][9][10][11] Scholars warn that summarizing evidence across a variety of qualitative approaches may result in the loss of context.Furthermore, they note that the typical notions of reliability and validity commonly applied to quantitative research may be inappropriate standards for qualitative research, arguing that the formulaic applications of such criteria generally do not lead to higher-level insights. 12,13There is also concern expressed by some that qualitative research and subsequent syntheses may appear unscientific or lack rigor. 12However, others acknowledge the importance of evaluating qualitative evidence in syntheses. 7,8,12,14Proponents of critical appraisals of qualitative research note a number of advantages, including highlighting the variation in study quality, guarding against distorted findings, enhancing the transferability of qualitative research, 7,13,14 and improving the practical utility of findings. 6 number of existing tools include enumerated checklists (e.g., Cochrane's GRADE-CERQual, critical appraisal skills program [CASP], COREQ, ENTREQ, the JBI tool, ETQS, etc.) and emphasize some common design elements for qualitative research: qualitative approach, research design, methods, analysis, reporting, researcher role and position, ethics, theory, context, reflexivity, reliability/validity, and depth and breadth of work.[1][2][3][4]9,10,[15][16][17] Other tools incorporate traditional quantitative techniques such as weighting methods 19,20 or calculating reliability.21 Some authors argue for the development of reporting guidelines for qualitative research that parallel those used for randomized controlled trials (RCTs).18,[22][23][24][25] Additional approaches call for the development and use of criteria based on the research questions asked and the context of the review, 2 where reviewers decide the most appropriate criteria, especially as the synthesis process unfolds.8,25 Thus, no single set of criteria are definitive for assessing the quality of qualitative research, as critical appraisal tools can only approximate quality and not guarantee it.3,4,15,26,27 In addition, the diversity of approaches hinders standardized appraisal, particularly for mixed evidence, especially given the contextdependent nature of qualitative research.8,25 Yet, pragmatic approaches exist that encourage the use of broad guidelines to steer the development and use of criteria and checklists.Quality may be best understood as multidimensional and address fundamental issues common across diverse qualitative approaches.In other words, tools do not necessarily exist to impose a standard application across studies, but facilitate the need and use of appropriate criteria in general given the uniqueness of individual studies (see Cochrane Qualitative and Implementation Methods Group).4,12,13

| AN INCLUSIVE RUBRIC FOR CRITICAL APPRAISAL OF MIXED EVIDENCE
As part of a larger project designed to inform the nextgeneration of practice guidelines for juvenile drug treatment courts (JDTCs), we undertook a systematic review on the implementation barriers and facilitators of JDTCs that included qualitative, quantitative, and mixedmethods studies.Search, screening, and study retrieval was limited to process or implementation evaluations.Outcome evaluations were eligible for inclusion, provided that these studies included a process or implementation component.The systematic search identified 59 studies.We used the meta-aggregation method 4,5 to synthesize findings across studies.The first step of this process was extracting text-based statements from the eligible manuscripts of each identified finding relevant to our purpose.A finding could be based on a qualitative or quantitative empirical result (see Harden et al. for a discussion on integrating diverse methods at various phases of a review). 28This process resulted in the identification of 477 findings for our synthesis across the 59 eligible studies. 29e created two parallel credibility assessments tools (i.e., checklists), one for qualitative findings and one for quantitative findings (see Figures 1 and 2).The checklist for each assessment was used to grade the credibility of each finding relative to the following general principles: (1) adequacy of the supporting data (e.g., was there a sufficient sample size, number of interviews, time spent observing, etc., for the finding to be credible), (2) whether a finding was well supported by the data, irrespective of qualitative or quantitative method used, and not based on the opinion of the author(s), 6 and (3) whether the approach to analyzing the data, qualitative or quantitative, was appropriate to the task at hand.The specific implementation of these three principles as applied to qualitative and quantitative findings is shown in Figures 1  and 2. The creation and implementation of our criteria also assumes a quantitative lens, as this was the epistemological posture of the authors at the time of implementation.Our position should further elucidate the uses and benefits of our approach, but also its limitations.The key highlight is our effort to hold qualitative and quantitative evidence together on the same plane, as integrated, as opposed to separate and distinct.
Each item on the assessment tools was scored as 1 if the response was "yes" and 0 if the response was "no."These scores were summed producing a four-point scale for grading that ranged from 0 to 3. Based on this assessment, the higher the score, the stronger the evidence for a particular finding.As discussed below, these individual quality assessment scores fed into a grading system for synthesized aggregated findings across studies.While not explicitly noted in the rubric, our processes for synthesizing findings avoided grouping contradictory information.Our objective was to develop coherent and consistent synthesized findings.We created different synthesized statements for contradictory findings as they represented distinct themes.In doing so, we captured diversity about the nature of findings for a singular topic.For example, an insight about parental involvement in youth treatment courts could be positive and negative.One theme focused on the importance of parental involvement to ensure compliance and treatment adherence.The other theme noted how parents could be an obstacle to holding youth accountable and in some cases, parents were in need of treatment themselves.These themes were two aspects of the broader topic on parental involvement, as partners in facilitating behavioral change and as barriers to treatment implementation.We rarely encountered findings on topics with themes that canceled out each other.
A challenge in developing any grading system is that simply averaging the ratings across the individual findings will generally penalize any synthesized finding with evidence from numerous studies as the likelihood of lowquality evidence increases, independent of the validity of the synthesized finding.For example, a synthesized finding based on three high-credibility studies and three lowcredibility studies, and where no study has a finding counter to the synthesized finding, should not be rated as less credible than a synthesized finding based simply on threehigh credibility studies.Thus, the grading method for synthesized findings involved selecting the highest credibility rating that had at least two findings from two studies, assuming no counter or contrary evidence related to that finding.Recall, the strength of evidence scale ranged from 0 to 3 for findings from qualitative and quantitative findings.We labeled these as "questionable," "low," "moderate," and "high" for levels 0 to 3, respectively.This full grading system is shown in the following rubric: 0 ¼ no two findings with ratings above a 0, or contradictory findings 1 ¼ at least two findings with a 1 or higher rating, no contrary findings 2 ¼ at least two findings with a 2 or higher rating, no contrary findings 3 ¼ at least two findings with a 3 rating; no contrary findings Notice the "or higher" language in this rubric.This focuses on getting above a threshold, rather than a simple summation of scores.Specifically, we took into consideration the highest value of a rating connected to a synthesized finding which was also associated with at least two unique studies contributing to the overall synthesized finding.For example, looking at the ratings from Table 1, under the column "quality of evidence," the overall values associated with the evidence rating are: three 0 s, one 1, one 2, and one 3 (i.e., 3, 2, 1, 0, 0, 0).Using the method above, we were able to calculate an overall quality of evidence rating for mixed evidence as a 2, or moderate strength of evidence, in support of the synthesized finding.
This method allows for grading the quality of evidence for included findings from both qualitative and quantitative studies.Thus, the scoring focuses on the highest quality of evidence supporting a synthesized finding.The synthesized finding itself is informed from all of the findings from which it emerged in the thematic coding.In cases where mixed evidence (qualitative and quantitative) comprises a synthesized finding, this method attempts to evaluate all the ratings evenly, without omitting the inclusion and influence of lower ratings.The threshold rule and rubric is inclusive of evaluating lower-rated findings that may advance a meaningful finding, a deliberate allowance that facilitates the reflexive use of appraisal criteria, whereby the presuppositions of reviewers and how grading can occur is accounted for in the rubric itself. 3,11In practice, the grading process begins with reviewing the contribution of a unique finding, irrespective of quality rating, followed by the formal evaluation of the overall evidence when all findings are combined into a synthesized finding.Furthermore, other assessment criteria may be substituted or the adoption of existing appraisal tools (e.g., GRADE-CERqual, the JBI tool, CASP, ETQS, etc.) 10 could be an option.Yet, as long as any appraisal approach uses ratings and seeks to aggregate those ratings in a meaningful way and per some category of information (e.g., fidelity, methods,

Quality of evidence
Accountability outside the JDTC can be translated to supporting caregivers, families, and family members to improve discipline, supervision, and compliance (e.g., financial issues, disagreement with court…) Final credibility rating 2 analysis, etc.), the threshold and rubric rule can aid in establishing an overall quality of evidence score.

| CONCLUSION
The novelty in our approach is two-fold: first, we treat and extract the statistical interpretations from quantitative studies as qualitative findings and critically appraise them along with findings from qualitative studies using separate but parallel rating schemes.Second, we complement this assessment with the development of an overarching rubric to combine the separate ratings of evidence into an overall quality of evidence score for synthesized findings.In this regard, we expand the way in which we can bring together research findings on the same topic from different research traditions-a simultaneous mixed-method approach to both synthesis and appraisal of evidence.Furthermore, our approach offers an initial method for balancing the need for flexible grading criteria of individual findings at one stage, but then the use of an inclusive rubric at another stage, such as an aggregation stage that can, if desired, standardize the process of reaching an overall rating of synthesized findings.In practice, the assessment of the quality of evidence for individual findings can change and include the reflexive integration of criteria 3,25 to remain sensitive to the dynamism of qualitative research approaches and how findings are generated.Yet, the rubric for appraising individual findings that are eventually aggregated remains relatively stable and focused on a few key universal issues.
The approach presented here emerged from an effort to produce a credible list of practice recommendations to guide the evidence-informed implementation of JDTCs.The appraisal, rubric, and threshold rule helped with the development of these implementation guidelines, which are now viable for quantitative evaluation to assess their generalizability.In this regard, our method of appraisal was crucial for assessing the quality and credibility of qualitative findings for the development of aggregated results that had an application to a larger quantitative project.However, the drawback of this assessment process revealed that on average, qualitative studies in our review were ranked lower in terms of credibility, thus presumably reinforcing a false hierarchy between qualitative and quantitative research and critiques of critical appraisal, both of which can overshadow the depth and meaning of synthesized findings.This pattern could have been a manifestation of the included studies not reporting key aspects of the study methods, as standard reporting criteria is another parallel need for evidence synthesis, although a less contested issue (see References [23,25]).For example, a common reporting weakness of the qualitative studies was the lack of any information regarding how the qualitative data were analyzed.These studies, however, may have used rigorous qualitative methods (and we were agnostic about the particular methods used), but without a description of the methods used, the findings were coded as less credible. 29hile our approach adds to a diversity of existing critical appraisal tools for qualitative evidence synthesis, this addition also reflects debates regarding the use of critical appraisal for qualitative evidence in evidence synthesis or mixed evidence, generally.Specifically, a number of issues challenge innovations and consistency in critical appraisal development.First, the unlikelihood of consensus for evaluating qualitative research given epistemological concerns about how qualitative research generates knowledge.Second, there are varieties of qualitative approaches used to generate knowledge.Some acknowledgment exists about the need for some form of assessment for quality and using reporting guidelines.Cohen and Crabtree observed, "qualitative research is not a unified field," 12 and any desire or search for a clear set of criteria for qualitative research must take this into account.While this reality makes reaching consensus on a standard set of appraisal criteria difficult, the lack of consensus may still present an opportunity for advancing appraisals in evidence synthesis involving mixed evidence.The inclusion and consistent use of critical appraisal tools may be realized with the development of critical appraisal criteria for the variety of different mixed-method approaches.The continued cataloging of current appraisal tools and evaluating the depth and breadth of coverage may be a useful starting point. 9,12,23,24,30,31RCID Ajima Olaghere https://orcid.org/0000-0003-4547-187XDavid B. Wilson https://orcid.org/0000-0001-6461-3904

F
I G U R E 1 Assessment tool for qualitative findings.[Colour figure can be viewed at wileyonlinelibrary.com]F I G U R E 2 Assessment tool for quantitative findings.[Colour figure can be viewed at wileyonlinelibrary.com]

T A B L E 1
Example of individual and synthesized (aggregated) assessments of a synthesized finding.