The efficacy of the National Surgical Quality Improvement Program surgical risk calculator in head and neck surgery: A meta‐analysis

The National Surgical Quality Improvement Program surgical risk calculator (SRC) estimates the risk for postoperative complications. This meta‐analysis assesses the efficacy of the SRC in the field of head and neck surgery.

surgeon-patient decision-making and pre-operative informed consent. 3This is particularly important in head and neck surgery, in which long procedures, serious comorbidities, and complex reconstructions contribute to significant perioperative and postoperative risk. 4,5ne publicly available tool is the American College of Surgeons' National Surgical Quality Improvement Program (ACS NSQIP) surgical risk calculator (SRC).In 2013, the SRC was developed using data from 393 ACS NSQIP hospitals and 1 414 006 patients. 6,7he SRC uses a current procedural terminology (CPT) code and 21 preoperative factors to predict a patient's risk for each of 13 postoperative complications.Since the development of the SRC, there have been many attempts to validate its use in specific surgical fields with mixed results. 8,9hile multiple retrospective cohort studies have shown only limited efficacy for the SRC in otolaryngology, 10,11 external validation of a prognostic model may require a sample size greater than 200 events, 12 which many of these studies individually fail to capture.This project is a meta-analysis that pools data from multiple cohort studies to better assess the efficacy of the SRC in head and neck oncologic surgery.The authors hypothesized that the SRC would not show adequate predictive value in this field given the unique pathophysiology and treatment risks for head and neck cancer.

| Systematic review and inclusion of literature
A systematic review of five online databases (PubMed, SCOPUS, Embase, Cochrane, and Google Scholar) was conducted using the Preferred Reporting Items for Systematic Review and Meta-Analysis method.Search terms were ("NSQIP" OR "national surgery quality improvement project") AND ("risk calculator" OR "risk" OR "SRC").Duplicate articles were excluded, and the remaining papers were screened by title and then by abstract by two independent authors (JH and DR), with a third author resolving any discrepancies (KR).Studies comparing the SRC's predictions to observed outcomes following head and neck oncologic surgeries were included.Studies that included nonotolaryngologic procedures or that were written in a non-English language were excluded.Ethical approval was not required from the Institutional Review Board (IRB) at the senior author's institution.

| Data collection
Data were independently extracted by two authors (JH and VA).The following surgical outcomes were included: postoperative mortality, any complication, serious complication, unplanned reoperation, surgical site infection (SSI), pneumonia, cardiac complication, venous thromboembolism (VTE), urinary tract infection (UTI), and discharge to a nursing facility.The SRC defines serious complications as including cardiac arrest, myocardial infarction, pneumonia, progressive renal insufficiency, acute renal failure, pulmonary embolism, VTE, return to operation room, deep incisional SSI, organ space SSI, sepsis, unplanned intubation, UTI, and wound disruption.Length of hospital stay (LOS) was also collected as a secondary outcome.For each complication, the total number of patients in the study, the number predicted preoperatively by the SRC to have the complication, and the number observed postoperatively to have the complication were recorded.If reported, receiver operating characteristic areas under the curve (ROC AUCs) and Brier scores were extracted for each complication.

| Statistical analysis
Pooled AUCs were calculated for each postoperative complication using a DerSimonian-Laird random effects estimator for τ 2 . 13Standard error and AUCs were calculated for studies not reporting these values using the methodology described by Debray. 14An ROC plots sensitivity against 1-specificity, and the AUC reflects the probability that a patient who experienced an event had a greater risk estimate than a patient who did not (called discrimination). 15An AUC >0.7 was predefined to reach an "acceptable" threshold and an AUC >0.8 was predefined to describe "excellent" discrimination, the standard for a high-quality predictive model. 10,16Interstudy heterogeneity was assessed with calculation of the I 2 statistic using the Mantel-Haenszel model with a score of I 2 < 50% defined as acceptably low.
Pooled Brier scores were calculated as a weighted sum of Brier scores from individual studies.Brier scores are the averaged squared difference between patients' predicted probabilities and observed outcomes.For each complication, studies were included only if a Brier score was provided or could be calculated from available data.Weights were calculated as the individual study sample size divided by the total number of subjects in the pool for the specific complication.These scores reflect both discrimination and calibration, defined as the ability to accurately predict outcomes across a range of risk. 17A Brier score of <0.01 describes an effective model. 18he predicted and observed LOS between studies were compared by calculating the percent error for each study weighed by sample size.An unpaired t-test was conducted to determine if the SRC significantly underpredicted LOS.
A subgroup analysis was performed in which pooled Brier scores were calculated separately for patients with and without free-flap reconstructions using the methodology above.Two studies that included both patients with and without free flaps were excluded from this analysis. 18,19All data analyses were conducted using Python 3.8 and the "meta" package from R (version 4.1.0,R Foundation for Statistical Computing, Vienna, Austria) in RStudio (version 4.1.1717,RStudio, Boston, MA).This report was deemed exempt by the IRB of the University of Pennsylvania Health System (UPenn-2023-00019.1).

| Literature search and included studies
The results of the literature search are presented in Figure 1.Of 3663 studies screened, nine met the inclusion criteria.Table 1 describes the study characteristics.The studies were published between 2016 and 2022 and included a total of 1774 patients.Four included only free-flap procedures (Arce, Ma, Tierney, Yung), 10,[20][21][22] two included no free-flap procedures (Kao, Subramaniam), 23,24 and three included both free-flap and non-free-flap procedures (Prasad, Schneider, Vosler). 11,18,19Five were conducted at American hospitals, and four were conducted internationally (Canada, India, and Australia).

| Length of stay
The SRC significantly underpredicted LOS.The average observed LOS was 207.9% of the predicted length in days (t-stat = 3.031, p = 0.019).

| Subgroup analysis
The results of the subgroup analysis are shown in Table 3 and presented in Figure 3.The Brier score for mortality in the non-free-flap cohort was 0.006.All other Brier scores for both subgroups were above the 0.01 threshold suggesting poor predictive modeling.

| DISCUSSION
The NSQIP SRC quantifies a patient's risks of postoperative complications using a CPT code and unique patient factors and is validated for general use in surgery. 25A tool that accurately predicts surgical risk would be particularly valuable in a field such as head and neck surgery, where up to 30% of patients can experience serious complications. 26While data from over 1.4 million operations were used in the development of the SRC, only 2% of these patients underwent otolaryngologic procedures. 18This is the first meta-analysis to evaluate whether the SRC retains efficacy when used for the specific population of head and neck patients.
This analysis found that the SRC underpredicts the risks of all postoperative complications except mortality.Only pneumonia and UTI reached the "acceptable" threshold for AUCs, and no complication reached the high-quality threshold for predictive accuracy.Similarly, calculated Brier scores demonstrated poor predictive value (scores ≥0.01) for all complications.Together these findings demonstrate that the SRC is unable to distinguish between high-and low-risk patients (poor discrimination) and is inconsistent across the range of possible risks (poor calibration), and therefore suggest the NSQIP SRC is not appropriate for use in head and neck surgery.
There are several factors that may contribute to these findings.The SRC was originally developed for colorectal cancer, 6 and although later studies have validated its use in surgery more broadly, 17 it retains features more relevant to general surgery.For instance, emergent presentation, sepsis, and acute renal failure are included in the model and are likely not applicable to patients undergoing elective oncologic resections.Conversely prior radiation, free-flap use, and operative time are not considered, despite evidence correlating these factors with complications in head and neck surgery. 4,27This may be why the model was better able to predict outcomes such as UTI and pneumonia, for which risk factors (including diabetes, steroid use, and Chronic obstructive pulmonary disease (COPD)) are both included in the SRC and are consistent across surgical disciplines. 28,29Meanwhile, risks such as SSI might be underappreciated by the SRC due to the use of multiple surgical subsites in free-flap reconstructions.Tracheostomy placement or dependence is not included, which may contribute to the significant inaccuracy in LOS predictions.Finally, while head and neck procedures often aim to address multiple physiologic pathways including breathing, phonating, or deglutination in addition to oncologic resection, the SRC's model allows for only a single CPT code. 20rior studies have suggested that the SRC may be inaccurate specifically for free-flap reconstructions given the severity of malignant disease and the complexities of microvascular surgery. 11,20Vosler et al. compared two cohorts of head and neck patients and found the SRC was accurate for those who did not undergo free-flap reconstruction, but underpredicted risk for those who did. 11We investigated this hypothesis by calculating Brier scores separately for patients with and without free-flap reconstructions.Brier scores were consistently elevated without a clear trend toward either group.Although mortality for non-free-flap patients met the 0.01 threshold, there was only one observed event in this subanalysis.Many of the limitations to the SRC described above apply to head and neck patients who do not undergo free-flap reconstruction, and our findings do not suggest the SRC can be effectively applied to this subgroup.

| LIMITATIONS
This meta-analysis should be considered in the context of several important limitations.Although we describe the largest cohort of otolaryngology patients in the literature, this still constitutes a relatively small population compared to the cohorts from which the SRC was developed.The SRC has been demonstrated to show greater accuracy in large populations, 15 and it is possible that the sample sizes used here do not fully capture the tool's efficacy.
Second, there was a significant degree of heterogeneity between the included studies, reflected by the high I 2 values presented in Table 2.However, seven of the nine studies concurred with our pooled result.In addition, the high degree of variation we identified between cohorts may itself be a reason to avoid relying on the SRC as results may be inconsistent between institutions.
A third limitation is the lack of a widely validated alternative to the SRC.Despite its flaws, the SRC remains  an easily accessible tool for quantifying risk, and some physicians may feel a model with a degree of error still provides some value.However, the SRC is not only imprecise but systematically underestimates risk, and therefore, even qualified use may misrepresent the serious risks associated with head and neck oncologic surgery.

| FUTURE DIRECTIONS
Recently proposed alternatives for head and neck surgical risk stratification warrant further investigation.Mascarella et al. used NSQIP data to construct a preoperative surgical risk index that compares favorably to the SRC for head and neck patients. 30Frailty models have shown efficacy in head and neck surgical risk prediction. 31,32eanwhile, machine learning (ML) has emerged as an alternative to traditional risk calculation.In ML, a computer model uses data to continually reoptimize its predictive algorithms to allow for better identification of complex, nonlinear relationships. 33Multiple ML models built specifically for head and neck surgical risk prediction have been proposed. 34,35Goshtabi et al.'s ML model predicts the risk of LOS and discharge to a nursing facility following complex head and neck surgery with ROCs >0.7 and is freely available online. 36ML models have identified human papilloma virus (HPV) status, 37 postadjuvant radiotherapy, 37 tumor grade, 38 and flap ischemia time 39 as inputs that may not be included in traditional indices (such as the SRC) but improve accuracy for head and neck surgical risk modeling.An ML model from Howard et al. even predicted which head and neck patients would benefit from postoperative chemotherapy with a demonstrated survival benefit. 40hile many of these models await external validation, this literature suggests that ongoing research might result in otolaryngology-specific alternatives to the NSQIP SRC that better stratify risks for head and neck patients.

| CONCLUSION
Despite prior literature validating the ACS NSQIP SRC for use in broad surgical populations, this meta-analysis found that the SRC consistently underrepresents postoperative risks for head and neck patients, with poor discrimination and calibration across a range of outcomes.These inconsistencies are seen for surgeries with and without free-flap reconstructions.Our findings do not support the use of the NSQIP SRC in this field, and further research is needed to validate an alternate model.

F
I G U R E 1 Preferred Reporting Items for Systematic Review and Meta-Analysis flow diagram of the systematic review process.[Color figure can be viewed at wileyonlinelibrary.com]

F
I G U R E 2 The SRC's predicted and observed complication rates for each of the 10 included outcomes.[Color figure can be viewed at wileyonlinelibrary.com]T A B L E 3 Comparison of Brier scores between patients undergoing head and neck surgery with free-flap reconstruction ("free-flap-subgroup") and those undergoing head and neck surgery without free-flap reconstruction ("non-free-flap subgroup").

F I G U R E 3
Comparison of Brier scores between the overall cohort, free-flap reconstruction subgroup, and non-free-flap reconstruction subgroup.A Brier score <0.01 suggests strong predictive value.[Color figure can be viewed at wileyonlinelibrary.com] T A B L E 1 Description of studies meeting the inclusion criteria.