Warfarin dosing algorithms: A systematic review

Aims Numerous algorithms have been developed to guide warfarin dosing and improve clinical outcomes. We reviewed the algorithms available for various populations and the covariates, performances and risk of bias of these algorithms. Methods We systematically searched MEDLINE up to 20 May 2020 and selected studies describing the development, external validation or clinical utility of a multivariable warfarin dosing algorithm. Two investigators conducted data extraction and quality assessment. Results Of 10 035 screened records, 266 articles were included in the review, describing the development of 433 dosing algorithms, 481 external validations and 52 clinical utility assessments. Most developed algorithms were for dose initiation (86%), developed by multiple linear regression (65%) and mostly applicable to Asians (49%) or Whites (43%). The most common demographic/clinical/environmental covariates were age (included in 401 algorithms), concomitant medications (270 algorithms) and weight (229 algorithms) while CYP2C9 (329 algorithms), VKORC1 (319 algorithms) and CYP4F2 (92 algorithms) variants were the most common genetic covariates. Only 26% and 7% algorithms were externally validated and evaluated for clinical utility, respectively, with <2% of algorithm developments and external validations being rated as having a low risk of bias. Conclusion Most warfarin dosing algorithms have been developed in Asians and Whites and may not be applicable to under‐served populations. Few algorithms have been externally validated, assessed for clinical utility, and/or have a low risk of bias which makes them unreliable for clinical use. Algorithm development and assessment should follow current methodological recommendations to improve reliability and applicability, and under‐represented populations should be prioritized.


| INTRODUCTION
Warfarin remains the most commonly prescribed oral anticoagulant for the management of thromboembolic disorders. 1 However, dosing remains challenging due to warfarin's narrow therapeutic index and highly variable clinical response. These dosing challenges usually result in a high frequency of adverse effects (thrombosis and bleeding) as well as an increased burden to the patient (e.g. more frequent monitoring), which could impact quality of life and lead to treatment discontinuation of an otherwise highly efficacious drug. 2

| Search strategy and selection criteria
A predefined protocol (PROSPERO: CRD42019147995), based on the principles set in the CHARMS (CHecklist for critical Appraisal and data extraction for systematic Reviews of prediction Modelling Studies) checklist, 11 and PROBAST (Prediction model Risk Of Bias Assessment Tool), a tool meant to assess the risk of bias and applicability of prediction model studies 12 was followed. This report adheres to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) statement (Table S1). MEDLINE records (from 1946 to 22 August 2019) were searched using medical subject headings (MeSH terms) and text words related to "warfarin", "algorithm" and "dosing" (Table S2) Additionally, for a clinical utility assessment study to be included, a comparison between a dosing algorithm with an alternative strategy (such as existing clinical practice) was a prerequisite. For the purposes of this review, clinical utility 13 was defined as the demonstration that a dosing algorithm improved the quality of anticoagulation (based on the time spent in the therapeutic INR range) or lead to better clinical endpoints (such as fewer bleeding episodes). Not to be confused with the outcome to be predicted in the individual studies (i.e. the stable warfarin dose), the primary outcome of interest in this review was the warfarin dose-prediction algorithm developed, and whether it was externally validated or evaluated for clinical utility in the included studies.

| Data extraction and quality assessment
One reviewer (I.G.A.) screened titles and abstracts of the retrieved bibliographic records for eligibility. For all stages, a second reviewer (R.O.) independently checked a random 10% of the records to check for consistency. Disagreements were resolved by consensus and because the first reviewer was consistent with regard to following agreed upon criteria, only the first reviewer continued reviewing the remaining records. A data extraction form was adapted from the CHARMS 11 and PROBAST 12 tools, piloted in a subset of randomly selected included papers and used to extract relevant information related to participants, predictors, outcome, analysis and results. When a single publication reported both development and external validation studies (and/or clinical utility assessments), or multiple algorithms, each study/algorithm was assessed separately. 12 The exception was studies that reported the warfarindosing.com platform-although this platform incorporates multiple algorithms, it was not possible to separate the individual algorithms and so it was considered as 2, the clinical and pharmacogenetic Gage algorithms. 14 Algorithm updating/extension studies in which new predictors were added to existing algorithms were considered as new algorithm development studies. 12 To assess the methodological quality of each included development or external validation study, the 2 reviewers used the PROBAST tool. 12 Although this tool focuses on prediction models that consider binary or time-to-event outcomes and studies that use generalized linear modelling, its authors encourage its use in studies that consider other outcomes and other machine learning techniques such as those explored in this review. 12 It should, however, be tailored to these other outcomes/techniques as we did in Tables S3 and S4 and Figure S1. For reasons detailed in Table S3, emphasis was placed on the assessment of the risk of bias in the analysis domain. We did not assess the methodological quality (and performance) of clinical utility (impact) assessment studies since these have been previously explored in several systematic reviews and meta-analyses. [15][16][17][18][19][20][21][22]

| Data synthesis
This systematic review was qualitative in nature and no attempt to quantitatively synthesise studies by way of meta-analysis was con- Where these race categories were unreported, country was used as a proxy (for example populations from China were categorized as Asian while populations from northern Europe as White). Regarding which algorithm would be relevant to a given population, we arbitrarily chose a 5% cut-off, i.e. an algorithm that recruited at least 5% of a given population would be applicable to that population. These F I G U R E 1 PRISMA flow chart of included studies. a Includes studies that neither stated in their aims that they were developing/validating a dosing algorithm nor reported dosing equations, nomograms, charts, tables, or other tools that can be used to provide a daily or weekly dose. b Prior doses and international normalized ratios not counted descriptive analyses were conducted in R (version 3.6.1). 24 No sensitivity analyses were conducted.

| Nomenclature of targets and ligands
Key protein targets and ligands in this article are hyperlinked to corresponding entries in http://www.guidetopharmacology.org, the common portal for data from the IUPHAR/BPS Guide to PHARMACOLOGY.

| RESULTS
We aimed to summarize which algorithms are available for which populations, and the covariates, performances and risk of bias of these algorithms. Figure Table 1 Figure 2). The median sample sizes for these studies were Consequently, most external validations were conducted on pharmacogenetic (n = 432 external validations, 90%) and dose initiation (n = 443, 92%) algorithms, algorithms developed using multiple linear regression (n = 458, 95%) and those that presented a regression formula (n = 453, 94%). A similar trend was observed for the clinical utility assessments (Table 1).

| Predictive performance
A consideration of the race-specific proportions in each stratified analysis (Tables S10-S13) should be made when interpreting the racestratified performances. For example, for 24 studies that included at least 5% Black patients, the proportion of warfarin dose variability that can be attributed to VKORC1 is 23%. However, these 24 studies on average included a median of only 13% (range 5-100%) Black patients.
When only the 3 studies that included only Black patients are considered, the median VKORC1 partial R 2 becomes 9% (range 7-10%).
These partial R 2 values should also be cautiously interpreted since different computation approaches yield different results ( Figure S1).
Regarding the precision (predictive accuracy) and bias measures, the most reported measures were the mean absolute and mean prediction errors, respectively, being reported 137 (32%) and 17 (4%) times (algorithm development) and 222 (46%) and 144 (31%) times (external validations). The median mean absolute errors for the algorithm development and external validations were respectively F I G U R E 2 Algorithm development/ evaluation by publication year F I G U R E 3 Predictors included in at least 10 algorithms. APOE, apolipoprotein E; CYP2C9, cytochrome P450, family 2, subfamily C, polypeptide 9; CYP4F2, cytochrome P450, family 4, subfamily F, polypeptide 2; PK parameters, pharmacokinetic parameters (S-warfarin clearance and/or distribution volume); INR, international normalized ratio; VKORC1, vitamin K epoxide reductase complex subunit 1  Tables 2 and S17.
Because most studies reported R 2 (a fit accuracy measure), we carried out a posthoc correlation analysis and included the studies that reported both the R 2 and a precision accuracy measure to determine whether R 2 could be a good proxy of predictive accuracy. For this purpose, we used the mean absolute error as the predictive accuracy measure because it was the most reported (its limitations as a predictive accuracy measure (Table S4) Tables S5   and S7). To directly compare the performances of algorithms stratified according to the modelling technique and time of application (dose initiation or dose revision), we summarized the studies that, using the same dataset, included at least 2 algorithms that differed in these 2 characteristics. As expected, dose revision algorithms generally performed better than dose initiation algorithms (Table S18). Multiple linear regression performed comparable to or even better than many other machine learning techniques (Table S19). Although pharmacokinetic/pharmacodynamic algorithms (fitted using nonlinear mixed effect modelling) performed better than other algorithms, this is mainly attributable to their dose revision aspects (i.e. when used for dose initiation, performance was comparable). However, the numbers of direct comparisons were few, and the performance metrics used were probably suboptimal (Table S4).

| Risk of bias
We focused on the assessment of the risk of bias in the analysis domain (Tables S3, S4, S6, S7 and S20). During algorithm development, most developments had the number of participants per candidate predictor variable ≥20 (n = 203, 47%), did not provide information on the handling of continuous and categorical predictors (n = 291, 67%), probably included all enrolled participants in the analysis (n = 229, 53%) and did not provide information on the handling of participants with missing data (n = 233, 54%; Table S20). Additionally, many algorithm developments relied on univariable (n = 204, 47%) and multivariable (n = 208, 48%) analysis during predictor selection, did not appropriately evaluate algorithm performance (n = 232, 54%), did not account for model overfitting and optimism in algorithm performance (n = 300, 69%), and did not provide enough information to assess whether predictors and their assigned weights in the final algorithms corresponded to the results reported in the multivariable analysis (n = 220, 51%). Consequently, only 1 (<1%) algorithm was rated as having a low risk of bias (unclear n = 26, 6%; high n = 406, 94%).  In some studies (e.g. Botton, 30 You, 31 Tan, 32 Biss, 33 Zhou, 34 Lin, 35 Xie 36 ) these performance measures were unclear or inconsistent with their definitions (if available) and/or reported values, in which case a best guess was made. For example, a negative mean absolute error was likely to be a mean prediction error.
Although we did not focus on the risk of bias in the participant, predictors and outcome domains, the key risk of bias concerns in these domains are reported in Tables S6 (algorithm development)  and White (19% of 153) algorithms. Despite a higher inclusion of these other CYP2C9 variants in studies employing at least 5% Blacks, 36% may still be a low figure given the importance of these Africanspecific variants. 37 When undertaking multivariable modelling, other population-and/or clinical setting-specific considerations such as availability and cost of predictors should also always be considered. 12 Our third objective was to evaluate the performances of these algorithms. As reported previously, 40 the coefficient of determination (R 2 ) was the most common performance measure (reported in 75% of algorithm developments and 54% of external validations). Based on R 2 , the median contribution of clinical factors (20%) and VKORC1 (25%) was similar to previous estimates 7,38 although CYP2C9's contribution (7%) was lower (previously estimated at 12 38 and 15% 7 ).
Among the first of 2 key cautions is, like for all the other performance measures, these summary estimates were descriptive in nature since we did not conduct a formal quantitative synthesis, which with the preferred measures (Table S4) and methods (such as individual participant data meta-analysis 41 ) is possible. Because of the descriptive nature of the study, different algorithms using the same or overlapping datasets was also of little concern. The second cautionary warning is that R 2 is a fit accuracy and not a prediction accuracy measure, the former of which is of less relevance when evaluating the value of prediction algorithms. 42 guidelines, 49 which if followed correctly would result in a low risk of bias rating with the risk of bias assessment tool that we used. It is of concern that although the TRIPOD guidelines were published in 2015, none of the other 163 studies reported from 2016 onwards refer to its use. In the context of not adhering to current methodological recommendations, warfarin dosing algorithms may not be unique. 50,51 The consequences of most of the design flaws have been previously described in detail. 52 One key issue that has received less attention is data transformation (done in 44% of the algorithm developments). As discussed by Keene, 53 we also discourage data-driven decisions to transform or not and recommend that the logarithmic transformation be preferred because it produces a proportional/multiplicative scale that is clinically relevant and easy to interpret. 53 55 Pirmohamed et al. 56 and Gage et al. 57 providing conflicting results-we neither quantitatively synthesized (both benefit and safety) nor assessed the risk of bias of the clinical utility studies since we felt these have been previously explored in more detail in several systematic reviews and meta-analyses. [15][16][17][18][19][20][21][22] In addition to the CPIC guidance, before using these or other algorithms, clinicians, guideline developers and/or policymakers are reminded to ensure their applicability to their respective populations.  65 Although we tried identifying other studies through reference list searching, using only the MEDLINE database also limited the number of studies that we could include in this review. Lastly, we relied on singlereviewer extraction; a second reviewer, nevertheless, confirmed consistency based on a random selection of 10% of the included papers.
For further research, novel/existing algorithms may need to be developed or updated and externally validated following the recommended guidelines such as TRIPOD. 49 More attention needs to be paid to under-represented populations such as minority ethnic groups and children (only 3% developed algorithms) to reduce health disparities. Moreover, although newer directly acting oral anticoagulants have been developed, warfarin is likely to remain the preferred choice for some of these groups. 66 In conclusion, this systematic review provides a comprehensive All authors read and approved the final manuscript.

DATA AVAILABILITY STATEMENT
All relevant material is provided in the supplementary material.