External validation of a United Kingdom primary‐care Cushing's prediction tool in a population of referred Dutch dogs

Abstract Background A prediction tool was developed and internally validated to aid the diagnosis of Cushing's syndrome in dogs attending UK primary‐care practices. External validation is an important part of model validation to assess model performance when used in different populations. Objectives To assess the original prediction model's transportability, applicability, and diagnostic performance in a secondary‐care practice in the Netherlands. Animals Two hundred thirty client‐owned dogs. Methods Retrospective observational study. Medical records of dogs under investigation of Cushing's syndrome between 2011 and 2020 were reviewed. Dogs diagnosed with Cushing's syndrome by the attending internists and fulfilling ALIVE criteria were defined as cases, others as non‐cases. All dogs were scored using the aforementioned prediction tool. Dog characteristics and predictor‐outcome effects in development and validation data sets were compared to assess model transportability. Calibration and discrimination were examined to assess model performance. Results Eighty of 230 dogs were defined as cases. Significant differences in dog characteristics were found between UK primary‐care and Dutch secondary‐care populations. Not all predictors from the original model were confirmed to be significant predictors in the validation sample. The model systematically overestimated the probability of having Cushing's syndrome (a = −1.10, P < .001). Calibration slope was 1.35 and discrimination proved excellent (area under the receiver operating curve = 0.83). Conclusions and Clinical Importance The prediction model had moderate transportability, excellent discriminatory ability, and overall overestimated probability of having Cushing's syndrome. This study confirms its utility, though emphasizes that ongoing validation efforts of disease prediction tools are a worthwhile effort.


| INTRODUCTION
Cushing's syndrome is an umbrella term for a range of clinical syndromes that is caused by a chronic excess of glucocorticoid activity, which can be because of a range of endogenous or exogenous steroid hormones. 1It is a common endocrine disorder in dogs.3][4] Spontaneous Cushing's syndrome is caused by an excessive production of glucocorticoids, often leading to a typical case presentation.Common clinical signs include polydipsia, polyuria, polyphagia, panting, abdominal enlargement, hepatomegaly, dermatological changes, and muscle atrophy.Frequently observed clinicopathological abnormalities include a stress leukogram, increased serum alkaline phosphatase activity (ALP), and hypercholesterolaemia. 5,6ltiple adrenal function tests and differentiating tests are described for diagnostic purposes, including urine corticoidto-creatinine ratio (UCCR) with or without suppression using oral dexamethasone, low dose dexamethasone suppression test (LDDST) on the basis of blood cortisol, and adrenocorticotropic hormone (ACTH) stimulation test also on the basis of blood cortisol measurement.][8][9][10][11][12] Recently, a prediction tool was developed and internally validated to aid the diagnosis of spontaneous Cushing's syndrome in dogs. 13,14This model demonstrated a good predictive performance in dogs attending UK primary-care practices, using neuter status, age, breed, polydipsia, vomiting, potbelly/hepatomegaly, alopecia, pruritus, urine specific gravity (USG), and serum ALP as predictor variables.
A prediction model can be validated internally and externally.
With internal validation, the model is tested in patients who belong to the original population and indicates how the model would likely perform in a very similar population.With external validation, the model is tested in patients belonging to another population.External validation is an important part of model validation, as different population characteristics could have a major influence on the performance of a prediction model. 15,16These variations in population characteristics have been categorized as temporal, geographic, and domain differences. 17,18With temporal validation, the same study is performed in a similar population from a later time period.With geographic validation, the model performance is tested by varying the location of the population (eg, primary-care practices in the UK vs primary-care practices in the Netherlands).Domain validation can be performed when one suspects differences in patient groups (eg, primary-care vs secondary-care practice).0][21] The aim of this study was to assess the aforementioned prediction model's transportability, applicability, and diagnostic performance in a secondary-care practice in the Netherlands.All cases and non-cases were subsequently independently reviewed by the primary and last author and verified to be compliant (cases) or not (non-cases) with ALIVE criteria for diagnosis of Cushing's syndrome. 24,25Those not fulfilling ALIVE criteria were excluded.Current ALIVE criteria for PDH and ADH are shown in Table 1.

| Prediction tool
The published Cushing's Prediction Tool (Table 2) was used as described in the aforementioned study. 14Information regarding clinical signs and laboratory results within 1 week before and 1 week after the point of first assessment at the internal medicine service were used for scoring purposes.If a specific clinical sign was not mentioned in the medical record, it was considered to be absent.

| Additional laboratory factors and comorbidities
In addition to the baseline characteristics used as predictors in the prediction tool, the presence or absence of lymphopenia, hypercholesterolemia, persistent proteinuria using urine protein-to-creatinine ratio, urinary tract infection using culture, uroliths using imaging, vacuolar hepatopathy using cytology or histopathology, diabetes mellitus, systemic hypertension, thromboembolic disease (considered absent if not reported), and gallbladder sludge or mucocele using ultrasound were noted for every case.

| Statistical analysis
Statistical analysis was performed using R (R version 4.2.1, and RStudio v2022.02.3+492; packages mosaic, car, Hmisc, Epi, PredictABEL, and fmsb).Baseline characteristics were expressed as n and % for categorical variables.Subgroup differences in baseline characteristics were examined using Wilcoxon's rank sum test for non-normally distributed numeric variables and Fisher's exact test for categorical variables.
The first step in the external validation of the diagnostic prediction tool was to subjectively qualify the level of relatedness between the case mix of the development and validation sample (model transportability).This is an important step, as it helps to differentiate between reproducibility and transportability.Reproducibility refers to a model's capacity to produce accurate predictions in a new sample that is very similar to the development population, whereas transportability refers to the ability to produce accurate predictions in a different population.To assess model transportability, one would ideally see a low to moderate degree of relatedness.If the development and validation sample appear to be (almost) identical, the external validation study could actually reflect the model's reproducibility. 26,27Using T A B L E 1 Current ALIVE criteria for diagnosis of pituitary-dependent hypercortisolism and adrenal-dependent hypercortisolism.

Pituitary-dependent hypercortisolism criteria
Accepted ways to fulfill criteria Identification of a set of clinical features attributable to Cushing's syndrome including.

Supportive history, physical examination findings and clinicopathologic test results
Demonstration of an excess of cortisol through dynamic testing of pituitary-adrenal function  This reflects whether individual dogs with Cushing's syndrome receive a higher predicted probability than those without.For c-statistic, 0.5-0.7 was interpreted as poor performance, 0.7-0.8 as acceptable performance, 0.8-0.9 as excellent performance, and >0.9 as outstanding performance. 28Significance was set at P < .05for all analyses.

| Descriptive statistics
In total, 242 medical records were identified.Twelve cases were excluded because they could not be assessed to be case or non-case using the ALIVE criteria.

| Model transportability
Several significant differences were found in the baseline characteristics of the development data set and the external validation data set (Table 5).Refitting the originally reported model in the current validation data set resulted in the regression coefficients and SEs presented in Table 6.
The discrimination of the model in the current dataset was excellent (c-statistic = 0.83; Figure 2).When setting the threshold of the prediction tool at 2 for the total prediction score (ie, dogs with prediction tool end score ≥2 are predicted as cases and <2 are predicted as non-cases), sensitivity was 91% and specificity 59%.This resulted in a negative predictive value (NPV) of 92% and positive predictive value No final diagnosis 25 (16.7)Adrenal mass (4), suspected alopecia X (1), suspected chronic enteropathy (6), suspected chronic lymphatic leukemia (1), suspected Cushing's, lost on follow-up (1), suspected intracranial disease (2), suspected dermatological disorder (1), psychogenic polydipsia vs primary nephrogenic diabetes insipidus ( 9) T A B L E 4 Baseline characteristics and Fisher's exact association stratified for cases and non-cases.There are several explanations for the moderate transportability of the model (ie, different case mix).Geographical differences could have resulted in the finding that breeds used as predictors in the prediction model were uncommon in our study cohort.One explanation for this could be a difference in breed-associated risk between the 2 countries, as breeding practices might lead to significant genetic differences and disease predisposition. 29,30However, it might also be explained by different breed popularities in the United Kingdom and the Netherlands.That the current study tested the model's performance in dogs that were presented to a referral hospital instead of a primary-care practice could explain why the dogs diagnosed with Cushing's syndrome in the present study had higher frequency of clinical signs associated with the condition.This could indicate that dogs diagnosed with Cushing's syndrome in a secondary-care practice show a more prominent clinical picture compared to those diagnosed in primarycare practice.Another explanation could be that the attending internists were more likely to recognize or report these clinical signs.Additionally, the dogs with and without Cushing's syndrome were more similar to each other in the current study than those in the UK *P value <.05 implies a significant difference in baseline characteristics of cases and non-cases.
primary-care caseload.Dogs without an overly clear clinical picture of Cushing's syndrome but still with some of the suggestive clinical signs and laboratory variables might be referred more often.In addition, veterinarians specialized in internal medicine within secondary-care practice might be more familiar with and/or confident in assessing the clinical picture of Cushing's syndrome.Indeed, polydipsia, potbelly/ hepatomegaly, and alopecia were noted more often in non-cases of the external validation group than non-cases of the development group.Such differences could explain why the model overestimates the probability of having Cushing's syndrome in the current cohort's non-cases.
In the current study, only a few of the original model's predictors were found to be significant in the external validation sample (maleentire, vomiting, alopecia, not diluted or not recorded USG, and not  However, it is possible that they were present but not recorded, leading to incorrect scores.The authors chose to use the internationally agreed ALIVE criteria for diagnosis of PDH and ADH.The veterinary endocrinology communities actively promote the use of ALIVE definitions since they foster uniformity and comparability of data. 36The results of this study should therefore be seen in the light of these specific disease definitions.Results could therefore also have been different when differing definitions were used, such as in the original British study.The same applies to the use of UCCR in combination with o-HDDST for the diagnosis and differentiation of Cushing's syndrome.This test is popular in the Netherlands, yet not commonly used internationally. 37A drawback of the prediction model itself is that it uses predictors that are also part of the reference standard for diagnosis (ie, the opinion of the attending veterinarian, based on a combination of medical history, clinical signs, physical examination, routine laboratory investigations, endocrine tests, and diagnostic imaging).This is known as incorporation bias, and it can lead to overestimation of the diagnostic accuracy. 38

ACKNOWLEDGMENT
No funding was received for this study.

CONFLICT OF INTEREST DECLARATION
Authors declare no conflict of interest.

OFF-LABEL ANTIMICROBIAL DECLARATION
Authors declare no off-label use of antimicrobials.

INSTITUTIONAL ANIMAL CARE AND USE COMMITTEE (IACUC) OR OTHER APPROVAL DECLARATION
Authors declare no IACUC or other approval was needed.

HUMAN ETHICS APPROVAL DECLARATION
Authors declare human ethics approval was not needed for this study.

9063-0238
Céline Anne Bik https://orcid.org/0009-0006-3039-0339 Imogen Schofield https://orcid.org/0000-0003-3169-8723 All electronic medical records of the Internal Medicine service of small animal referral clinic "Amsterdam Medisch Centrum voor Dieren" containing results of UCCR's in combination with an oral high dose dexamethasone suppression test (most commonly used in the Netherlands and at the investigational hospital) between January 2011 and December 2020 were reviewed.The use of UCCR and oral high dose dexamethasone suppression test (o-HDDST) has previously been validated 9,22,23 and involved the collection of a morning urine sample by the owner on 3 consecutive days.After collection of the second urine sample, the owner administered 3 oral doses of dexamethasone (0.1 mg/kg/dose) at 8-hour intervals.UCCR was measured in all 3 morning urine samples.In an animal with appropriate signalment and suggestive clinical signs hypercortisolism was suspected if the average of the first 2 UCCRs was ≥10 Â 10 À6 ; this cut-off was previously established by the University of Utrecht Veterinary Laboratory. 9If the third UCCR was <50% of the mean of the first 2 samples, this was considered suggestive of pituitarydependent hypercortisolism (PDH).According to the described validation of the methodology, a decrease >50% was considered to be suggestive of pituitary-dependent (but dexamethasone-resistant) hypercortisolism, ectopic ACTH excess, or ACTH independence (eg, adrenal-dependent hypercortisolism [ADH]).The same laboratory and assay were used for UCCR measurement as in the validation publications.Dogs were included as cases (ie, having Cushing's syndrome) if this was the final conclusion of the attending internist based on a combination of medical history, clinical signs, physical examination, routine laboratory investigations, endocrine tests (UCCR and o-HDDST), and diagnostic imaging (abdominal ultrasound ± computed tomography).Cases were excluded if a subsequent revision of the diagnosis was made in the medical record.Dogs were included as non-cases (ie, not having Cushing's syndrome) if the attending internist considered Cushing's syndrome and subsequently ruled out this diagnosis based on normal UCCRs and o-HDDST in combination with 1 or more of the following: medical history, clinical signs, physical examination, routine laboratory investigations, other endocrine tests, and diagnostic imaging.A definite alternative diagnosis was not required for non-cases.
the summary measures percentages, median and range, the distribution of the dog characteristics was compared.This included the predictors in the validated model and outcome occurrence.Furthermore, the extent to which the development and validation samples share common predictor effects was evaluated by refitting the original logistic regression model in the validation sample.The estimated regression coefficients and corresponding SEs were compared to evaluate the heterogeneity in the predictor-outcome associations.The second step of external validation involved examining the calibration and discrimination to assess the model's performance in the new validation sample.Calibration measures the agreement between observed and predicted outcomes.Calibration-in-the-large was given as the intercept term a from the recalibration model logit(y) = a + b Â logit(ŷ), in which the logit(y) is the natural logarithm of the observed odds of being diagnosed with Cushing's syndrome and logit (ŷ) the natural logarithm of the predicted odds of being diagnosed with Cushing's syndrome.Ideally, the intercept term a should equal 0. If a < 0, this indicates the model overestimates the odds, whereas a > 0 indicates underestimation.The calibration slope was estimated as b from the same recalibration model.Ideally, the calibration slope b equals 1.If 0 < b < 1, this often indicates predictions vary too much (ie, too low with low predicted probability, too high with high predicted probability), whereas b > 1 implies the opposite.Discrimination was evaluated calculating the c-statistic with 95% confidence intervals (CI), corresponding to the area under the receiver operating characteristic curve (AUROC) for the outcome diagnosis Cushing's syndrome.
The regression coefficients and standard errors of the original model are shown as a comparison.From the predictors of the original model, the regression coefficients of male-entire, vomiting, alopecia, USG not dilute, USG not recorded and serum ALP not elevated demonstrated a P < .05.The P value of the regression coefficient of polydipsia and potbelly proved <.10.All other regression coefficients demonstrated a P > .10.

(
PPV) of 54% for diagnosis of Cushing's syndrome in this group of referred Dutch dogs.Decreasing the threshold for the total prediction T A B L E 3 Final diagnosis recorded in the medical records for non-cases (n = 150).

4 |
DISCUSSIONThis study showed several significant differences in dog characteristics between development UK primary-care dogs and the external validation Dutch secondary-care study sample.Moreover, comparison of the estimated regression coefficients and corresponding standard errors revealed substantial heterogeneity in the predictor-outcome associations.These findings imply moderate transportability of the model, which means that the Dutch dogs differ from those in the original UK study, supporting that external validation was performed in the current study.
Calibration of the model detected a discrepancy between the predicted and observed odds of being diagnosed with Cushing's syndrome, with the model overestimating the probability of having Cushing's syndrome in the Dutch dogs.Some underfitting of the model was detected, implying less variation in the predictive chance of being diagnosed with Cushing's syndrome (high predictions were too low, whereas low predictions were too high).On the other hand, discriminating ability of the model (ie, can the model distinguish cases from non-cases) was found to be excellent given an AUROC of 0.83.This was fairly similar to the AUROC of the prediction tool in its development study sample.Overall, this implies that the model showed good performance in this group of dogs used for external validation.
several possibilities exist to improve the model's performance for use within secondary-care practice in the Netherlands, without creating a new model completely.As a first possible step, the intercept could be updated for a better calibration.26,31Second, to improve the model's discriminating ability, removing predictors from the original model could be considered, since several of the original predictors were not associated with outcome in our study (eg, breeds).Adding new predictor variables to the original model could be a third step.In the Dutch dogs, lymphopenia, hypercholesterolemia, and abnormal gall bladder content (sludge, mucocele) proved more common in cases than noncases.These variables were selected because of the evidence in the literature to indicate their discriminatory ability between dogs with and without Cushing's syndrome.[32][33][34]These adjustments were not part of the scope of this study.A new prediction model was not constructed using the Dutch data set, as the number of cases and noncases was relatively low.This could lead to overfitting, rendering the improved or new model not useful for future predictions.35The retrospective nature of the current study represents an obvious limitation.Laboratory findings were for instance not always available.Although the prediction model was developed to also be used when laboratory findings were not available, it might have shown a better performance with more data available.Moreover, clinical signs not mentioned in the medical record were considered to be absent.F I G U R E 1 Calibration plot for the outcome diagnosis of Cushing's syndrome.The plot shows the mean observed proportions of dogs with a diagnosis of Cushing's compared to the mean predicted probabilities, by deciles of predictions.The 45 line denotes perfect calibration.F I G U R E 2 Receiver operating characteristic curve for the outcome diagnosis of Cushing's syndrome.
Finally, an important limitation to note is the sample size.Ideally, a large-scale dataset is used for validation purposes to prevent imprecise predictive performance estimates.To minimize this effect, statistical methods were used to account for potential overestimation of model performance because of model fitting on the same dataset.This correction ensures a more accurate estimation of the model's predictive ability and helps mitigate any potential bias introduced by the relatively low numbers of cases.In conclusion, this external validation study showed moderate transportability of a Cushing's syndrome prediction model developed in the UK in a group of dogs presented to a secondary-care practice in the Netherlands.The model had excellent discriminatory ability.Overall, the tool did overestimate the probability of having Cushing's syndrome.Despite its limitations, the tool could still prove useful in its current form.Using a total prediction score cut-off of 0 (ie, score < 0 predicts the dog does not have Cushing's syndrome), it demonstrated an NPV of 99% and as such the model could be useful as a screening test early in the diagnostic pathway to rule out Cushing's syndrome as a likely explanation for the dog's clinical signs.Adrenal function tests and differentiating tests could then be pursued if Cushing's syndrome remains a probable diagnosis based on the tool's result, the dog's overall clinical picture and routine laboratory results.The study emphasizes ongoing validation efforts of this and other disease prediction tools are a worthwhile effort, when considering their use in populations with differing characteristics, such as in different countries or practice types.
14ediction tool to calculate the likelihood of a dog having Cushing's syndrome.Note: To calculate the predicted likelihood of an individual dog having Cushing's syndrome, one has to add together the points that correspond to the category for each predictor and match the final score to the predicted likelihood as published by Schofield et al.14This way, a likelihood between 0% (score À13) and 96% (score 10) can be predicted.