Have you had bleeding from your gums? Self- report to identify giNGival inflammation (The SING diagnostic accuracy and diagnostic model development study)

Aim: To assess the diagnostic performance of self- reported oral health questions and develop a diagnostic model with additional risk factors to predict clinical gingival inflammation in systemically healthy adults in the United Kingdom. Methods: Gingival inflammation was measured by trained staff and defined as bleeding on probing (present if bleeding sites ≥ 30%). Sensitivity and specificity of self- reported questions were calculated; a diagnostic model to predict gingival inflammation


| INTRODUC TI ON
Periodontal disease is an inflammatory disease that affects the soft and hard tissues supporting teeth. This disease is largely preventable, yet it remains the major cause of poor oral health worldwide and is the primary cause of tooth loss in older adults (Petersen & Ogawa, 2000). Periodontal disease is classified into two broad categories: (Petersen & Ogawa, 2000) gingivitis and (Tonetti et al., 2020) periodontitis. Gingivitis is usually characterized by, amongst other factors, gingival inflammation, which is a reversible condition identified by bleeding at the gingival margin. Gingival inflammation is an early sign of periodontal disease, (Tonetti et al., 2020) and its association with periodontal health has been explored in numerous publications (Kallio, 1996;Weintraub et al., 2013). Identifying gingival inflammation could help prevent periodontal disease.
The development of self-reported tools to measure periodontal disease risk is particularly important in the field of oral health surveillance , since clinical measures are difficult and costly to collect, and hard or impossible to standardize. The

Centres for Disease Control and Prevention (CDC) and the American
Academy of Periodontology recommended since 2003 the use of self-reported measurements that could be valid to predict the prevalence of periodontal disease as an alternative to examinations (Eke et al., 2012). Several self-reported oral health questions have been previously discussed as an alternative to clinical measures of periodontal disease (Blicher et al., 2015;Abbood et al., 2016), but studies focusing on gingivitis or gingival inflammation have been inconclusive due to small, non-representative samples and the use of single questions without the consideration of other potential risk factors (Abbood et al., 2016).
Prediction models and specifically diagnostic models allow the inclusion of several risk factors to predict the existence of a condition (Collins et al., 2015). Prediction models are currently used in different areas of medicine (Collins et al., 2015) and have been increasingly popular in assessing the prevalence and progression of periodontitis (Du et al., 2018). There are currently no models available to predict gingival inflammation. Diagnostic models of gingival inflammation could be used as a first line assessment in a clinical surveillance system, identifying patients that need further assessment, or they could be incorporated in large epidemiological studies to target clinical examinations in research participants.
We aimed to assess the diagnostic accuracy of several selfreported oral health questions individually to identify gingival inflammation, measured by bleeding on probing, in a large and representative UK-based sample. We developed and validated a diagnostic model including other gingival inflammation risk factors to assess whether additional risk factors would improve the diagnostic accuracy of these measures. We hypothesized that using self-reported oral health questions with or without additional risk factors could help identify clinical gingival inflammation and, therefore, constitute an alternative to clinical measurement in research settings.

| Data
We collected and combined clinical and self-reported data from adults participating in two large, UK-wide dental randomized trials.
Combination of data from both trials was deemed reasonable to follow TRIPOD guidelines and maximize statistical precision (Collins et al., 2015) and because inclusion criteria, setting, recruitment and data collection processes, patient reported and clinical outcomes collected were identical in both trials. Moreover, the trials were conducted by the same team of researchers.

| IQuaD trial
The IQuaD trial  compared the clinical and costeffectiveness of providing scale and polish 6 monthly, 12 monthly or never during 3 years; the clinical and cost-effectiveness of personalized vs standard oral hygiene advice were also compared. The study recruited 1877 participants from Scotland and Northeast England.
Recruitment began in 2011, and data were collected until September 2016.

| INTERVAL trial
The INTERVAL trial  compared the clinical and cost-effectiveness of 24-monthly recalls vs 6-monthly vs riskbased on the same outcome and randomized 2372 participants from Scotland, England, Wales and Northern Ireland. Recruitment began in July 2010, and data were collected until August 2018.
Both trials were pre-registered and had bleeding on probing as their primary outcome. They were set up in primary dental care provided by the National Health System in the United Kingdom.
Participants were recruited via their dental practices.
Participant-reported data, including the index tests, in IQuaD and INTERVAL were collected using an annual patient questionnaire from baseline until year 3 (for IQuaD) or 4 (for INTERVAL). The data used in our model are restricted to the final year follow-up of Practical Implications: A self-reported bleeding gums question or a diagnostic model is helpful to identify people with gingival inflammation and should be considered if unfeasible or impossible to collect clinical data. each trial; therefore, this is a cross-sectional analysis. IQuaD and INTERVAL collected similar information about the participants, but not all outcomes were collected in both trials.
Clinical data were collected at follow-up (3 or 4 years postrandomization for IQuaD and INTERVAL, respectively). Variables collected at the clinical assessment included the following: bleeding on probing, calculus and pocket depth. Clinical outcomes were measured at the end of each trial by trained outcome assessors. Gingival inflammation was measured according to the Gingival Index of Loe (1967) by running the UNC probe circumferentially around each tooth just within the gingival sulcus or pocket. After 30 s, bleeding was recorded as being present or absent on the buccal and lingual surfaces.
Percentage of sites bleeding on probing per participant was calculated by adding the sites where bleeding was present in each participant (two sites per tooth -buccal and lingual) and dividing by twice the number of teeth in the mouth and then multiplied by 100, thus generating a variable varying from 0 (no bleeding in any of the sites available) to 100 (bleeding in all sites available). More information about clinical outcome collection can be found in IQuaD's and INTERVAL's protocols (Clarkson et al., 2013;Clarkson et al., 2018). Participants did not have access to their clinical data before answering the questions and clinical examiners did not have access to self-reported data.

| Study population
Eligibility criteria were similar in both trials: both included adult participants (≥18 years of age) who were dentate, had attended a checkup at least twice in the 2 years prior to the trial and received their dental care in part or fully as a patient in the National Health System in the United Kingdom. Patients with an uncontrolled medical condition (e.g. diabetes and immunocompromised) were excluded from both trials. IQuaD had an additional inclusion criterion to INTERVAL: participants had to score 0-3 in their Basic Periodontal Examination.
Participants were approached and recruited if they met the eligibility criteria and accepted to take part; therefore, we used a convenience sample.

| Gingival inflammation definition
According to the most recent classification from the 2017 Classification of Periodontal Diseases (Chapple et al., 2018;Trombelli et al., 2018), for adults with pocket depths ≤3 mm, and an intact and a reduced periodontium, generalized gingivitis for epidemiological purposes can be defined as 30% or more bleeding sites. Localized gingivitis is defined by having between 10 and 29% of bleeding sites.
Participants with <10% sites bleeding are gingivitis-free. Because pocket depths could be higher than 3 mm and we did not measure clinical attachment level, we did not focus on gingivitis. We used 30% as the primary cut-off for our analyses and considered it to be an indicator of gingival inflammation. To explore the impact of lowering the threshold to define gingival inflammation in our models' secondary analyses used 10% sites bleeding as the cut-off.

| Outcome
Outcomes are the sensitivity and specificity of the index tests on detection of gingival inflammation (defined as having bleeding on probing equal or higher than 30%). We explored the impact of lowering the threshold to define gingival inflammation (using 10% or greater sites bleeding as the definition) in a sensitivity analysis.

| Index tests
Four self-reported bleeding or bleeding-related measures were used as the index tests in the current study. Table 1 presents the measures and their score. In order to perform the diagnostic accuracy analysis for single self-reported gingival inflammation measures, Likert items were transformed into binary measures. To decide the appropriate threshold, we assessed the scales' diagnostic performance at TA B L E 1 Four self-reported gingivitis measures used and their scores and transformation for individual diagnostic analysis

Self-reported measures
Original score (used in the diagnostic model)
A score of 2 or more was considered as a positive index test A score of 2 or more was considered as a positive index test different thresholds using receiver operating characteristics curves and selected the best performance. We assumed the best performance to be the one that had the overall best proportion of correctly classified individuals considering both sensitivity and specificity.

| Statistical analysis methods
Diagnostic measures (sensitivity and specificity) were calculated for each self-reported index test measure with 95% confidence intervals calculated using the Agresti-Coull method (Newcombe, 1998).
We used a complete case approach where diagnostic measures were calculated if participants had information in both the index test and reference standard (clinical gingival inflammation).

| Outcome
The main outcome of interest was generalized gingival inflammation defined as 30% or more sites with bleeding on probing. We explored the impact of lowering the threshold to define gingival inflammation (using 10% or greater sites bleeding as the definition) in a sensitivity analysis.

| Candidate predictors
All variables collected in IQuaD and INTERVAL were considered for inclusion in the model (Table S1) Table 1). Randomized treatment was not included in the model given the cross-sectional nature of the analysis, and the fact that any treatment was only provided to patients after the clinical outcomes was measured.

| Development of a prediction model to identify gingival inflammation
The univariable (unadjusted) associations between each candidate predictor and gingival inflammation were assessed using a logistic regression model, to assess the impact of each candidate predictor individually in relation to gingival inflammation. We assessed linearity between predictors and the log odds of gingival inflammation and found all relationships to be linear. A p-value of <.05 identified statistically significant univariate associations.

| Missing data
We assumed a missing at random mechanism and used multiple imputation as a sensitivity analysis to impute missing data. We excluded predictors from the multiple logistic regression complete case main model that had more than 10% missing data.

| Performance measures
Performance was evaluated using the full dataset following the appropriate recommendations (Collins et al., 2015). We quan-  from participants, and more information on the process has been published elsewhere Clarkson et al., 2018).

| RE SULTS
We included a maximum of 2853 participants that provided a measure of clinical bleeding and at least one of the index tests (self-reported questions related to gingival inflammation; Figure 1

| Diagnostic performance of the selfreported oral health questions
Figures S1-S3 show the ROC curves for each Likert scale question before transformation. The diagnostic performance for the four index tests is given in Table 3 using the main reference standard cut-off of 30%. The binary self-reported bleeding gums question had the highest specificity (0.89) but the lowest sensitivity (0.20).
The Likert scale self-reported bleeding gums question had the highest sensitivity (0.73) but poor specificity (0.39). Self-reported unpleasant taste in mouth and bad breath performed poorly.

| Diagnostic modelling results
All predictors available were selected at least once by the experts invited and therefore included in the full model. In the final model (Table 4)

| DISCUSS ION
To our knowledge, this is the largest study evaluating the diagnostic performance of self-reported oral health questions to detect F I G U R E 1 Flowchart of the cases available to be analysed in the diagnostic modelling (SR, self-reported; bin, Binary; lik, Likert) gingivitis. We have developed the first diagnostic tool for gingival inflammation using a large UK-wide sample, combining self-reported oral health questions with recognized gingival inflammation' risk factors. In this process, we have used the most up-to-date recommendations and reporting guidance (Collins et al., 2015;Cohen et al., 2016) and included expert discussion to identify potentially relevant risk factors.
Self-reported bleeding questions are preferable to questions focusing on gingival inflammation side-effects: they yielded better diagnostic performance and had lower rates of missing data, suggesting they are more acceptable to patients. These findings are in line with previous studies from Gilbert and Nutttall (Gilbert & Nuttall, 1999) and Dietrich (Dietrich et al., 2005). Even though halitosis has been shown to be associated with bleeding gums (Kayombo & Mumghamba, 2017), it is possible that the indirect nature of the questions or their potential negative connotation is unhelpful.
A single self-reported bleeding question (Likert item) had comparable diagnostic accuracy performance to the diagnostic model we developed with four predictors. The diagnostic model had marginally better sensitivity and better specificity than the self-reported bleeding Likert item, resulting in a higher, moderate discriminant ability (Model's C-statistic = 0.65 vs Likert item area under the curve (AUC) = 0.60). Adding clinical predictors had a limited impact in the model's diagnostic performance. In practice, researchers that can access the risk factors information included in our model will benefit marginally from it compared with asking patients about their bleeding gums with a single Likert item, but this benefit is unlikely to be of clinical importance.
Smoking status, oral health behaviour, previous pattern of having scale and polishes and self-reported bleeding were selected as predictors in the final diagnostic model. The association between smoking and periodontal health is well established (Abbood et al., 2016;Du et al., 2018), and previous periodontitis prediction models had found oral health behaviour and dental visits as significant predictors (Du et al., 2018).
The original studies (IQuaD and INTERVAL) were not specifically designed as diagnostic accuracy or diagnostic model development studies, although we followed recommended practice where possible, including blinding assessors to index test results.
Candidate predictors available were limited to those collected in IQuaD and INTERVAL and probably excluded important predictors that could have improved the diagnostic accuracy of the prediction model. SING's outcome, gingival inflammation, is challenging for two reasons: its definition and its measurement.
We used gingivitis most recent threshold classification (Chapple et al., 2018) as an indicator of gingival inflammation. IQuaD and INTERVAL included intensive measurement training of outcome assessors and examined two tooth surfaces per tooth (partial mouth examination). This is recognized to be highly desirable for both patients and oral health professionals, and particularly in the context of two large UK-wide trials, even though the extent to which the results would differ from the results of a full-mouth examination is unknown (Trombelli et al., 2018). Bleeding on probing is impossible to calibrate. This may result in measurement error and misclassification of inflammation status which can lead to biased risk scores and c-statistics (Zawistowski et al., 2017).
Because the extent of misclassification is unknown, we could not address it in our analysis, which is a common problem in this field (Kuchenhoff et al., 2009).
The diagnostic performance results found in SING are similar to other periodontitis prediction models (Du et al., 2018) and show challenges across the periodontal health field in identifying prediction models that are clinically useful. This could be due to a non-comprehensive Number of teeth 23.6 (4.7), 2853 identification of candidate predictors, but it is also related to heterogeneity in definition and measurement of disease and inflammation and absence of external validation of models.
There is space to improve the sensitivity of the diagnostic model developed here. Researchers have suggested other potential predictors to improve sensitivity of prediction models in this field: gingival cervical fluid, salivary markers or microbiology information have been identified as useful in periodontitis models (Du et al., 2018); systemic factors such as metabolic or nutritional factors and haematologic conditions can affect the extension, severity and progression of gingivitis and gingival inflammation (Chapple et al., 2018).
However, it is important to find a balance between discrimination ability and cost of a diagnostic tool, as well as ease of implementation.

Specificity
The results reported in SING need to be replicated in different settings (when it comes to the diagnostic accuracy study) and externally validated (when it comes to the diagnostic model developed). SING included a large and representative sample of regular attenders to the dentist in the UK NHS, but it excluded participants with serious oral health disease and its results might not be generalizable to that population. On the one hand, participants attending the dentist regularly might be more aware of their oral health condition, but on the other hand participants suffering from more serious disease might be better at self-reporting it (Blicher et al., 2015).
We showed a simple self-reported bleeding gums question yielded similar but lower diagnostic accuracy to a diagnostic model including clinical and patient factors. Its sensitivity (0.73) is good and its specificity is low (0.39), so it needs to be improved to consider replacing clinical measures with self-reported ones. Including other risk factors improved diagnostic accuracy, but not in a clinically significant way. There are two possible ways forward: the improvement of self-reported bleeding gums questions by involving patients in their development and refinement and by collecting information closer to real-time, using health apps, which has recently showed promising results in identifying gingival inflammation using self-reported bleeding (Tonetti et al., 2020); the improvement of diagnostic accuracy in future diagnostic models to identify gingival inflammation by involving appropriate stakeholders in the identification of candidate predictors. As it currently is, the self-reported bleeding gums Likert item or the model is helpful to identify people with gingival inflammation and they should be considered if unfeasible or impossible to collect clinical data, for example in multiple imputation models if patients have missing clinical data.

ACK N OWLED G EM ENTS
We would like to acknowledge all study participants in IQuaD and INTERVAL for contributing with their information, and the trial teams involved in planning, conducting and delivering both trials.

CO N FLI C T O F I NTE R E S T
The authors have stated explicitly that there are no conflicts of interest in connection with this article.

AUTH O R CO NTR I B UTI O N S
BG, GM and CR designed the study. CR co-led the data collection process. BG analysed the data. All authors were involved in the interpretation of results. BG drafted the manuscript. GM and CR commented on the manuscript. All authors agreed to the final version of the manuscript.

DATA AVA I L A B I L I T Y S TAT E M E N T
The data that support the findings of this study are available from the corresponding author upon reasonable request.