Reliability of pediatric Rome IV criteria for the diagnosis of disorders of gut–brain interaction

The diagnosis of disorders of gut–brain interaction (DGBI) in children is exclusively based on clinical criteria called the Rome criteria. The inter‐rater reliability (IRR) measures how well two raters agree with a diagnosis using the same diagnostic tool. Previous versions of the Rome criteria showed only fair to moderate IRR. There have been no studies assessing the IRR of the current edition of the pediatric Rome criteria (Rome IV). This study sought to investigate the IRR of the pediatric Rome IV criteria and compare its reliability with the previous versions of the Rome criteria. We hypothesized that changes made to Rome IV would result in higher IRR than previous versions.


| INTRODUC TI ON
Disorders of gut-brain interaction (DGBI) are a group of disorders that result from a combination of disturbances in gastrointestinal motility and visceral sensitivity, alteration of immunological and mucosal function, dysbiosis as well as change in perception and processing of gastrointestinal tract signaling by the central nervous system. 1 For the past 30 years, the Rome criteria have served There are currently no biological markers to facilitate the diagnosis of DGBIs.In the absence of objective markers, the diagnosis is exclusively based on the Rome criteria.Consequently, it is critical that the applied diagnostic criteria be reliable.Reliability ensures consistency in the diagnostic process, facilitates communication and common understanding among healthcare professionals, and enhances the decision process for each patient's needs.Additionally, reliability plays a pivotal role in generating accurate and comparable data across studies and as a necessary step in the process of validation of a diagnostic tool.Thus, poor reliability hinders clinical practice and research. 1o studies have been published to date on the reliability of Rome II and Rome III criteria, both of which were published by our group 2,3  (Chogle, 2010 #20; Saps, 2005 #19).In 2005, we performed a study on 10 pediatric gastroenterology specialists and 10 pediatric gastroenterology fellows, who were provided with 20 clinical vignettes with nine different intended diagnoses and 17 possible diagnostic options to choose from.The study assessed the reliability of Rome criteria using the inter-rater reliability (IRR), a measure of the extent to which different observers are consistent in their diagnosis using the criteria.The study found fair to moderate IRR for the Rome II criteria.In 2010, our group replicated the study to assess the reliability of the Rome III criteria.We again found fair to moderate IRR for the Rome III criteria as well as discordance in agreement between fellows and pediatric gastroenterologists and gastroenterologists with and without expertise in DGBIs.Both studies indicated the need for further improvement of Rome criteria to make it more comprehensible and operator-friendly. 2,3However, to date, there is limited data on the reliability of the Rome IV in the 5-18 year-old subgroup, and no data on the reliability of the 0-4 year-old subgroup.Kaul et al., 4  While the study by Kaul et al., 4 provides valuable insights into the reliability of the Rome IV criteria, we designed a study that would allow us to assess the evolution of the reliability across different iterations of the Rome criteria.Our study sought to evaluate the IRR of the Rome IV diagnostic criteria using the same clinical vignettes as the previously published studies on Rome II and Rome III.The primary aim was to evaluate the diagnostic IRR while using the Rome IV criteria in children.The secondary aim was to compare the IRR of Rome IV with previously published studies on the Rome II and III criteria.We hypothesized that the IRR of the Rome IV criteria will be higher than the IRR of the Rome II and III criteria.

| ME THODS
To compare the reliability of Rome II, III, and IV criteria, we utilized the same methodology-including the same clinical vignettes, number of participants, and levels of expertise-as in the previous studies on the reliability of Rome II and Rome III criteria. 2,3To test our hypothesis, we invited 10 pediatric gastroenterology fellows and 10 board-certified pediatric gastroenterologists to participate in the study.The group of 10 board-certified pediatric gastroenterologists comprised five experts in neurogastroenterology and five specialists with expertise in other groups of GI conditions.Expertise in neurogastroenetology was defined by a physician having at least 15 publications on functional gastrointestinal disorders.Each participant was given 20 vignettes that covered nine diagnoses and a list of 17 possible response options to choose from.The participants were also given a copy of the Rome IV criteria to facilitate their diagnostic assessments.To account for additional Rome IV diagnoses that were not included in the previous Rome reliability studies or vignettes that the rater considered that none of the diagnoses listed applied, we included the options "none of the above" or "not enough information."Each rater was instructed to select one diagnosis per vignette.The vignettes were identical for all raters and were based on randomly selected cases from real pediatric patients followed in the DGBI's clinic.The study was approved by the institutional review board of the University of Miami.

| Statistical analysis
The responses were evaluated for IRR using the percentage of agreement.The percentage of agreement was calculated by dividing the total number of clinicians congruent diagnoses for individual cases by the total attainable number of scores.To account for possible random agreement, we also calculated the Cohen's kappa 1 coefficient, a measure of pairwise agreement corrected for chance.This coefficient can range from −1 to +1, where +1 represents the perfect agreement among the raters and 0 represents the expected agreement from random chance.The kappa result

Key points
• The Rome IV criteria revisions have increased interrater reliability in diagnosing Disorders of Gut-Brain Interaction (DGBIs).
• Although there is a noted improvement in diagnostic consistency with the Rome IV criteria compared to previous versions, the reliability remains moderate, highlighting the need for ongoing refinement to reach higher agreement levels.
was interpreted as follows: values ≤0 as no agreement and 0.01-0.20 as slight, 0.21-0.40 as fair, 0.41-0.60 as moderate, 0.61-0.80 as substantial, and 0.81-1.00as almost perfect agreement. 5To account for the probability of error involved in accepting the k value, we calculated the p value using the method described by Siegel and Castellan. 6

| RE SULTS
All 10 gastroenterologists and 10 fellows who were invited to the study agreed to participate and completed all the vignettes.The mean years of experience among the pediatric gastroenterologists was 12 years (range 2-28 years).The pediatric gastroenterology fellows ranged from first to third year in training, with mean years of fellowship of 2.7 years.Our primary aim was to evaluate the diagnostic IRR of the Rome IV criteria in children.The average IRR among the raters was 55% (range 30-100%) for the pediatric gastroenterologists and 48.5% (range 30-90%) for the pediatric gastroenterology fellows (Table 1).The IRR per clinical case was ≥50% in 9 out of 20 (45%) vignettes for the gastroenterologists and 8 out of 20 (40%) cases for the fellows.Only one vignette achieved unanimous agreement among the gastroenterologists' group, whereas the fellows group did not reach complete agreement on any vignette.The kappa coefficient was 0.54 for the specialists (p < 0.0001) and 0.47 for the fellows (p < 0.0001).The kappa coefficient for the five experts in neurogastroenterology was 0.53 (p < 0.0001).The five pediatric gastroenterologists without specific expertise in neurogastroenterology had a kappa coefficient of 0.52 (p < 0.0001).All of these are considered moderate agreement.
Our secondary aim sought to compare the IRR of Rome IV with the two previously conducted studies that examined the IRR of the Rome II and III criteria.The prior studies reported kappa coefficients of 0.37 (p < 0.0001) (fair) and 0.45 (p < 0.0001) (moderate) for specialists, and 0.41 (p < 0.0001) (moderate) and 0.39 (p < 0.0001) (fair) for fellows, in relation to the Rome II and Rome III criteria (Table 2). 2,3We found higher kappa coefficients for both

| DISCUSS ION
The Rome criteria are the byproduct of decades-long review of the updated literature and international expert group discussions.The latest version of the Rome criteria (Rome IV) incorporated several key changes.Among the changes, the Rome IV simplified the classification of disease processes and modified its wording to avoid misinterpretations on the need to rule out organic diseases.Unlike the previous versions of the criteria, Rome IV explicitly states that the need for medical evaluation should be decided by the practitioner on a case-by-case basis.The new criteria also clarified the possible coexistence of FGIDs with other organic (such as inflammatory bowel disease) or nonorganic gastrointestinal disease processes.An effort was made for definitions being clearly explained, refining criteria that was thought to be unclear or imprecise, and to incorporate new diagnoses such as functional nausea and vomiting and sub diagnoses as was the case of functional dyspepsia and IBS.
Our study suggests that the changes implemented in Rome IV had a positive effect leading to improvements in IRR among subspecialists and fellows.The percentage of agreement found in this study was 55% for subspecialists and 48.5% for fellows.As shown in Table 2, the IRR of the Rome IV is higher than for the Rome II and Rome III criteria.Our study also found higher kappa coefficients, indicating an improvement in reliability even when corrected for the possibility of agreement by chance.
Moreover, the study showed that the IRR per clinical case ≥50% has also increased from previous versions of the Rome criteria to Rome IV.As expected, specialists (k = 0.54) scored better than fellows (k = 0.47).This finding is consistent with the results of the Rome III criteria study but stands in contrast with the Rome II study, where fellows (k = 0.41) had higher kappa values than specialists (k = 0.37).
In the Rome II study, we contextualized the low to moderate reliability by comparing the findings with data from other reliability investigations.For instance, a study on the interobserver reliability of triage in the emergency room reported a kappa value of 0.3. 7milarly, an investigation on the interobserver reliability of pediatric pulmonologists diagnosing asthma using NIH guidelines revealed a kappa value of 0.3. 8Even a study investigating the agreement of diagnosing breast cancer by mammography among radiologists revealed a kappa value of 0.6, which is only moderate agreement. 9ese studies underscore the challenges encountered in achieving consistent agreement.Although the present study demonstrated improved reliability, it is important to note that higher levels of reliability are attainable.A study on the agreement of the diagnoses of psychiatric disorders using the DSM-IV achieved a kappa coefficients above 0.84, which is considered almost perfect agreement. 10ese findings highlight the need for continued efforts to improve the reliability of the Rome criteria.Interestingly, our study observed a higher degree of agreement among participants in diagnosing defecation disorders compared to abdominal pain disorders.This variation highlights the importance of considering disorder-specific nuances when evaluating IRR.

Specifically, particular attention to criteria surrounding abdominal pain
disorders may be considered in future editions of the Rome criteria.This is the first study to evaluate the impact of the changes made to the pediatric Rome criteria on the diagnostic reliability of the criteria and to compare its reliability to prior Rome studies, a crucial measure to establish its validity.The design of our study presents a robust framework for comparing the reliability of current and future versions of diagnostic criteria.A key strength of our study lies in its consistent and standardized methodology across the three Rome criteria reliability studies.This is particularly relevant for several reasons.The use of identical clinical vignettes across all studies suggests that differences in kappa values between Rome II, III, and IV could be attributed to changes in the criteria themselves, rather than variations in the clinical vignettes.By maintaining a consistent number of participants and ensuring their experiences in the field are similar across studies, we have reduced the potential for variation in kappa values due to differences in rater pool.Using the same methods across all studies further helps to control for potential confounding factors.These elements combined enhance the strength of our findings and lend weight to the argument that the observed differences in kappa values reflect genuine changes in the reliability of the Rome criteria, rather than extraneous factors or methodological inconsistencies.Another key strength of our study lies in the comprehensive nature of our assessment.We employed a wider range of answer choices, totaling 16, including "none of the above" and "not sufficient information," which introduces greater complexity and a more rigorous test of diagnostic agreement among participants.This approach is reflective of the varied diagnostic scenarios encountered in clinical practice.Additionally, by utilizing 20 clinical vignettes and including both the younger and older pediatric population, we provided an extensive evaluation of the IRR of the Rome IV criteria across a TA B L E 1 Comparison of the inter-rater percentage of agreement of pediatric gastroenterologists and fellows.broad spectrum of clinical scenarios.The detailed and complex nature of these vignettes, designed to mirror real-world clinical challenges, provides a robust framework for assessing IRR.These methodological choices contribute to a nuanced understanding of the Rome IV criteria's applicability and reliability in diverse and realistic clinical settings.
Our study has several limitations.To be consistent with the methodology of the prior studies, we included only 10 fellows and specialists in each group, which may not be reflective of the entire community of pediatric gastroenterology fellows and specialists.
Moreover, in order not to modify the methods from previous studies, we did not include the newly incorporated diagnoses as diagnostic options.Interestingly, instead of this influencing negatively the IRR, the study showed an improvement in reliability.Another limitation was the lack of demographic data among the participants.
As such, it is unknown whether factors such as sex/gender could confound comparisons across studies.The difference in mean years of experience across the three studies may also limit the strength of our study.We also acknowledge that confidence intervals would have provided a valuable perspective in interpreting the precision and reliability of kappa estimates, especially in comparing these values across different groups and studies.

| CON CLUS ION
Our study demonstrated moderate IRR of the Rome IV criteria, and as a validated tool for the diagnosis of DGBIs in children.The criteria are periodically updated based on newly published data and expert opinion.The most recent version, Rome IV, was published in 2016.
investigated the reliability of the Rome IV criteria by administering clinical vignettes (including children 6-17 year-old) to 34 gastroenterology fellows and faculty members and requiring them to identify the most likely Rome IV diagnosis.The study found a 68% inter-rater agreement.
the fellows and pediatric gastroenterologists groups compared to the previous studies on the Rome II and Rome III criteria.The improvement in kappa was consistent among subspecialists with and without expertise in neurogastroenterology.The pediatric gastroenterologists with expertise in DGBIs demonstrated moderate agreement (k = 0.53, p < 0.0001) in Rome IV, whereas they demonstrated only fair agreement in Rome II (k = 0.37, p < 0.0001) and Rome III (k = 0.37, p < 0.0001).The pediatric gastroenterologists with expertise in other groups of GI conditions both had moderate agreement in Rome III (k = 0.53, p < 0.0001) and Rome IV (k = 0.52, p < 0.0001), whereas they only had fair agreement in Rome II (k = 0.38, p < 0.0001).This indicates an improvement in IRR when utilizing the Rome IV criteria, as evidenced by the increased kappa values obtained in our investigation.
improved reliability when compared to previous versions of the Rome criteria.Overall, the results of our study offer a promising outlook on the future of DGBIs diagnosis, pointing toward the consistency of a diagnosis when using the Rome IV criteria.The improved reliability not only enhances the utility of Rome IV for accurate diagnosis in clinical practice, but also underscores the value of ongoing research and the revisions to these criteria.As we move toward a better understanding of the entire spectrum of DGBIs, revisions to the Rome criteria will continue to play a crucial role in guiding clinicians toward accurate diagnoses and, ultimately, better patient care.Nevertheless, further work is required to refine the criteria, as reliability is still only moderate.Future editions of Rome should aim to further improve the psychometric properties of the Rome criteria and to facilitate its use.Future studies also may investigate why certain groups, particularly those with expertise in DGBIs, exhibited more pronounced changes following the transition to Rome IV.

Vignette Intended diagnosis Inter-rater percentage of agreement-pediatric gastroenterologists Most common diagnosis Inter-rater percentage of agreement-fellows Most common diagnosis
Overall comparison of inter-rater reliability of the Rome II, III, and IV criteria.