Diagnostic and prognostic performance of the SAFE score in non‐alcoholic fatty liver disease

The steatosis‐associated fibrosis estimator (SAFE) score was developed to detect clinically significant liver fibrosis in patients with NAFLD in the United States. We compare the performance of the SAFE score and other non‐invasive tests to diagnose liver fibrosis and to correlate the scores with liver‐related outcomes in patients with NAFLD in Hong Kong.


| INTRODUC TI ON
Non-alcoholic fatty liver disease (NAFLD) is currently the most common cause of chronic liver disease and is highly prevalent in patients with obesity and type 2 diabetes. 1,2NAFLD represents a disorder, spanning from simple steatosis (non-alcoholic fatty liver, NAFL) to non-alcoholic steatohepatitis (NASH), which is the aggressive form of NAFLD associated with increased liver-related complications.
[5] While liver histology remains the gold standard in determining liver fibrosis, recent advances in non-invasive clinical tools, including liver stiffness measurement (LSM) devices, clinical prediction scores and proprietary biomarkers, have been developed to detect liver fibrosis in patients with NAFLD.The SAFE score is the latest addition to the list, which was designed to have high negative predictive values (NPV) in primary care, to aid triaging patients who are at the lowest risk of having clinically significant fibrosis (F0-1 vs. ≥F2). 6In contrast, earlier scores were developed for advanced fibrosis (≥F3), examples including fibrosis-4 (FIB-4) index, 7 NAFLD fibrosis score (NFS) 8 and aspartate aminotransferase (AST)-to-platelet ratio index (APRI). 9en it comes to these non-invasive tests (NITs) for liver fibrosis in NAFLD, one size may not fit all and the SAFE score needs to be validated in additional settings and its performance assessed against existing non-invasive scores and LSM.In particular, in this study, taking advantage of an existing cohorts of patients, we examine the potential utility identifying patients with liver fibrosis with high sensitivity (i.e.low false-negative tests) for referral for LSM and specialty consultation or for predicting future liver complications.Secondarily, we conduct a subgroup analysis in patients with diabetes in light of a recent multicentre French study suggesting that the existing fibrosis scores are less accurate in patients with type 2 diabetes.

| Clinical assessment and data collection
In the biopsy cohort, comprehensive clinical assessment was performed.
Anthropometric and clinical data were recorded at the time of the liver biopsy, including age, sex, weight, height, waist circumference and hip circumference.Blood samples were taken after at least 8 hours of fasting for complete blood cell count, international normalized ratio (INR) of the prothrombin time, fasting plasma glucose, haemoglobin A1c, total cholesterol, high-density lipoprotein (HDL) cholesterol, low-density lipoprotein (LDL) cholesterol, triglycerides, ALT, AST, gamma-glutamyltransferase (GGT), albumin and globulin.In the territory-wide cohort, demographic data, BMI and laboratory data were retrieved from CDARS.The baseline date was defined as the date of first diagnosis of NAFLD.
Patients who had NAS ≥4 with at least 1 point in each component and fibrosis stage ≥F2 were considered to have fibrotic NASH.

| Liver stiffness measurement
In patients from the biopsy cohort, LSM and controlled attenuation parameter (CAP) were performed within 1 week before liver biopsy using a vibration-controlled transient elastography (VCTE) device (FibroScan 502 machine and FibroScan 502 Touch machine, Echosens, Paris, France) by experienced operators according to the training and instructions by the manufacturer. 15Patients with ≥10 valid LSM and CAP, and an interquartile range-to-median ratio of the measurements of ≤.3 were considered to have valid results.The median measurement expressed in kilopascals (kPa) was reported as liver stiffness and CAP was expressed in dB/m.

| Liver-related events
In the territory-wide cohort, liver-related events were defined as a composite endpoint of HCC and cirrhotic complications (ascites, spontaneous bacterial peritonitis, variceal bleeding, hepatic encephalopathy, hepatorenal syndrome), liver transplantation and/or liver-related mortality based on ICD-9-CM diagnosis codes and procedure codes (Table S1).

| Data analysis
In both the biopsy and territory-wide cohorts, laboratory-based fibrosis tests were calculated according to published formulas: SAFE score was calculated with age (years), BMI (kg/m 2 ), presence of diabetes (YES/NO), AST (U/L), ALT (U/L), globulin (g/dL) and platelet count (10 9 /L). 16FIB-4 index was calculated with age (years), platelet count (10 9 /L), AST (U/L) and ALT (U/L). 17APRI was calculated with platelet count (10 9 /L) and AST (U/L). 9NFS was calculated with age (years), BMI (kg/m 2 ), presence of impaired fasting glucose or diabetes (YES/NO), platelet count (10 9 /L), AST (U/L), ALT (U/L) and albumin (g/dL). 8AGILE 3+ score was calculated with age (years), sex (male/female), AST (U/L), ALT (U/L), platelet count (10 9 /L), presence of diabetes (YES/NO) and LSM (kPa). 18brotic NASH index (FNI) was calculated with AST (U/L), haemoglobin A1c (%) and HDL (mg/dL). 19FibroScan-AST (FAST) score was calculated with LSM (kPa), CAP (dB/m) and AST (U/L). 20To compare the diagnostic and prognostic accuracy of non-invasive tests, we used the published cut-offs of the following tests: SAFE score (low cut-off, 0; high cut-off, 100), FIB-4 (low cut-off, .96;high cut-off, 1.45), APRI (low cut-off, .43;23] For descriptive analyses, continuous variables were presented as mean ± standard deviation (SD) or median (interquartile range In the territory-wide cohort, the proportion of missing lab values was less than 20%, missing values were imputed by stochastic regression imputation.Kaplan-Meier method was used to estimate the cumulative incidence of liver-related events; log-rank test was performed to compare the cumulative incidence in different patient groups.The performance of the noninvasive tests in predicting liver-related events was measured by the time-dependent AUROC and compared with non-parametric bootstrapping. 25l statistical tests were two-sided.p value < .05 was considered statistically significant for all tests.All analyses were performed by IBM SPSS 26 and R software Version 4.0.0.

| Characteristics of patients in the biopsy cohort
Of 1268 patients undergoing liver biopsies during the study period, 470 with NAFLD were eligible to be included in the analysis (Fig- ure S1A).The first set of columns of Table 1 presents data for the entire cohort.The mean age of the patients was 52.1 ± 11.9 years, 259 (55.1%) were male and 265 (62.4%) had type 2 diabetes.There were 221 (47%) patients with significant fibrosis (≥F2), whose characteristics were significantly different from those with F0-1 in the majority of demographic, laboratory and histological parameters in the expected direction.
Of the five non-invasive tests evaluated, missing data were fewest for VCTE (n = 39), followed by FIB-4 (n = 150), APRI (n = 150), NFS (n = 167), SAFE (n = 170) and Agile 3+ (n = 173).Statistically significant difference was noted between subjects with ≥F2 and F0-1 for all NITs.The second set of columns in Table 1 presents data in 279 individuals who had complete data for all of the NITs.The subset appeared overall representative of the entire data set.S3)

| Diagnostic performance of SAFE score in the biopsy cohort
In the detection of fibrotic NASH (NAS ≥4 and ≥F2), SAFE had a significantly higher AUROC than FIB4, APRI and FNI, and lower AUROC than Agile 3+, while it was non-significantly inferior to that of LSM.Finally, for detecting advanced fibrosis (≥F3), SAFE had a significantly higher AUROC than APRI and lower AUROC than Agile 3+, while it was non-significantly superior to FIB4 and NFS and non-significantly inferior to LSM. (Table 2) Figure S4 shows the correlation between the four blood-based NITs and fibrosis stage.SAFE score had a higher PPV or NPV and the smallest indeterminate zone for F ≥ 2 and F ≥ 3 in all non-invasive scores at dual cut-offs of 85% sensitivity and 95% specificity.(Table 2).Table 3 shows the performance metrics of non-invasive tests for the diagnosis of significant fibrosis.Of the two thresholds commonly employed to rule out and rule in the diagnosis, respectively, the higher ones are more relevant for this analysis of patients evaluated at a specialty practice.Based on available histological data, approximately 45% of the subjects had significant fibrosis.The proportion of subjects with markers values higher than the rule-in threshold ('test positive') was the highest for VCTE (45%), followed by SAFE (42%), FIB-4 (28%) and APRI (3%).Please note that not all individuals with marker values higher than the threshold had significant fibrosis.Among the blood markers, sensitivity, namely the proportion with marker values higher than the threshold among those with significant fibrosis, was the highest with SAFE (67%), followed by FIB-4 (51%) and APRI (5%).Conversely, PPV was the highest for FIB-4 (82%), followed by SAFE (73%) and APRI (70%).
For comparison, the sensitivity and PPV for VCTE based on the threshold of 9.0 kPa were 71% and 74% respectively.Table S2 shows the performance metrics of SAFE score and FAST score for the diagnosis of fibrotic NASH.
In ruling out significant fibrosis, SAFE had the highest NPV (79%), while APRI had the best specificity (72%), among the bloodbased markers.The NPV and specificity for VCTE were 87% and 29% at the threshold value of 5.8 kPa.The FIB-4 and SAFE scores classified fewer patients (22% and 26%, respectively) into the indeterminate zone than VCTE and APRI (35% and 39%, respectively).
The results did not change appreciably when the analysis was restricted to 279 patients with data for all the non-invasive tests (data not shown).

| Subgroup analysis of patients with and without diabetes
Table S3 represents subgroup analyses in patients with and without type 2 diabetes.There were 284 patients with diabetes, of whom 153 (54%) had complete data, whereas 126 of 186 patients (68%) without diabetes did.Although the smaller sample size limits the power to detect differences, there was no difference among the NITs including LSM among patients without diabetics.Among patients with diabetes, significant differences were found between SAFE and NFS for the diagnosis of clinically significant fibrosis and fibrotic NASH, and SAFE and Agile3+ for the diagnosis of advanced fibrosis.

| Baseline characteristics of patients in the territory-wide cohort
From the CDARS data, we identified 30 943 patients who met the diagnostic criteria for NAFLD from January 2000 to July 2021 (Figure S1B).A large proportion of the data were excluded because missing data as well as liver-related events occurring before a meaningful duration of follow-up.(Figure S1B).Eventually, 4603 patients were included in the analysis.The mean age of the patients was 56.1 ± 13.6 years and the mean BMI 28.0 ± 4.9 kg/m 2 (Table S4).Table S5 examines the representativeness of the patients included in the analysis vis-à-vis the entire cohort.Because of the large sample size, small p-values may not necessarily connote biologically or clinically important differences.With that caveat, diabetes and cirrhosis were more common, while the follow-up duration was longer, among patients included in the analysis.

| Performance of SAFE score and other non-invasive tests for predicting liver-related events
Based on suggested cut-offs of SAFE score (0 and 100 points), 854 (18.6%), 1596 (34.6%) and 2153 (46.8%) were in the low-, intermediate-and high-risk groups respectively.Six (.7%), 15 (.9%) and 59 (2.7%) developed liver-related events in the three groups respectively.Figure 1A depicts 5-, 10-and 15-year incidence of liver-related events.The statistically significant separation among the incidence curves was mainly attributable to the high-risk group (15-year incidence of 9.4% vs. 2.1% and 1.2% in the intermediate-and low-risk groups respectively).The same analysis was replicated for FIB-4 and APRI in Figure 1B,C respectively.The pattern for FIB-4 was similar to that of SAFE with the high-risk group having a much higher incidence than the intermediate-and low-risk groups.The separation was more even yet less pronounced with APRI.F I G U R E 1 Cumulative incidence of liver-related events in patients stratified by risk groups as defined by SAFE (A), FIB-4 index (B) and APRI (C) in the territory-wide cohort.
5-, 10-and 15-year incidence of liver events.Among patients who had liver-related events at 5 years, SAFE score could predict 84.9% of patients accurately, compared to 40.9% for FIB-4 and 27.2% for APRI.
Table 5 includes the time-dependent AUROCs (95% CI) for the prediction of liver-related events at 5, 10 and 15 years for SAFE, FIB-4, NFS and APRI.By and large, APRI had lower AUROCs compared to the other three, for which there was no significant difference.Figure S6 shows that SAFE score had a better performance of calibration than other NITs for 10-year liver events.When the analysis was repeated separately for patients with and without type  Fibrosis is the common final pathway of all chronic liver diseases which correlates with liver-related complications.otherwise, there would be too many patients who are missed.In addition, it should also have a reasonable positive predictive value, or there will be too many patients to be referred unnecessarily.
The data in Table 3 show that SAFE has the highest sensitivity, and thus, the lowest false-negative rate among the laboratory-based NIT.Its PPV is respectable at 73%, indicating approximately three of four patients referred would have fibrotic liver disease.These data suggest that the SAFE score, consisting of clinical data routinely available in primary care practices, may be better suited for initial evaluation of patients with NAFLD in primary care than the existing NITs.
Due to the high prevalence of advanced fibrosis among patients with type 2 diabetes, the European Association for the Study of the Liver, the European Association for the Study of Diabetes and the American Diabetes Association recommend liver fibrosis evaluation in this high-risk population. 27,28Recently, a group of French investigators demonstrated that the AUROCs of several simple fibrosis scores for advanced fibrosis were lower in patients with type 2 diabetes.However, this was probably confounded by age as the performance of the scores was similar in patients with and without diabetes after adjusting for age. 10 In our study, there were no statistical differences in the AUROCs of the SAFE score and its comparators for significant fibrosis between patients with and without type 2 diabetes.
Increasing body of data have shown that NITs such as FIB-4 and APRI predict liver-related complications similar to liver biopsy. 29In our study with a median follow-up of 6.8 years, the incidence of liver-related events in NAFLD patients correlated with the estimated risk of liver fibrosis, to the degree similar to past biopsy-based studies (6%-56%). 29As a prognostic indicator, NITs calibrated to detect F3 may have an advantage than those designed for F2, as obviously patients with F3-4 fibrosis are much closer to an event than those with earlier stage fibrosis.Another factor that may affect the prognostic performance of a model is the weight given to the age variable.Age is a powerful predictor of morbidity and mortality, which may explain SAFE, FIB-4 and NFS, all of which contain age, had better AUROCs than APRI, devoid of the age variable.
In this study, the AUROCs of the SAFE score in predicting liver-related events ranged from .69 to .76.Among patients with high SAFE score (>100), 2.7% of them developed hepatic events, TA B L E 5 Time-dependent area under the receiver-operating characteristics curves (AUROC) of non-invasive tests in 4603 patients from territory-wide cohort for the prediction of liver-related events at 5, 10 and 15 years (p-value by non-parametric bootstrapping).

2 | ME THODS 2 . 1 |
Data source and patients 2.1.1 | Biopsy cohort This is a retrospective cohort study involving two data sets.The first data set was derived from adult patients (age ≥18 years) who underwent a liver biopsy and prospectively enrolled in a registry with informed written consent at the Prince of Wales Hospital from July 2006 to November 2021.Patients with multiple metabolic risk factors, abnormal investigation results suggestive of advanced liver disease or diagnosis uncertainty were recommended to undergo liver biopsy.Eligible patients had biopsy-proven NAFLD defined by >5% liver steatosis.The exclusion criteria included the presence of other liver diseases (hepatitis B virus [HBV] infection, hepatitis C virus [HCV] infection and histologic features of an alternative diagnosis), a secondary cause of liver steatosis including excessive alcohol consumption (>30 g/day in men or >20 g/day in women), HIV infection or history of hepatocellular carcinoma (HCC) or other hepatobiliary malignancies and liver biopsy length <10 mm (Figure S1A).The study was approved by the Ethics Committee of the Chinese University of Hong Kong.2.1.2| Territory-wide cohort The second set represented a territory-wide retrospective cohort retrieved from the Clinical Data Analysis and Reporting System (CDARS) under the management of the Hospital Authority, Hong Kong.CDARS is an electronic health database that holds patients' demographics, death, diagnoses, medical procedures, drug prescription and dispensing history, and laboratory results from all public hospitals and clinics in Hong Kong.The diagnoses were identified by International Classification of Diseases, Ninth Revision, Clinical Modification (ICD-9-CM) codes in CDARS.Data from CDARS have been employed in prior territory-wide studies on NAFLD. 11-13All adult patients (age ≥18 years) with NAFLD (ICD-9-CM diagnosis code 571.8) first diagnosed from January 2000 to July 2021 were eligible.The exclusion criteria were as follows: positive viral or serological markers of chronic HBV or HCV infection, and/or use of antiviral therapy for HBV and/or HCV; no available HBV serology results; coinfection with HIV infection; excessive use of alcohol based on the alcohol consumption record or ICD-9-CM diagnosis codes; concomitant hepatobiliary diseases; missing key data more than 20% of the whole cohort (BMI or laboratory parameters necessary to calculate SAFE and other scores); and occurrence of liver-related events at or within 6 months of baseline (Figure S1B).

[
IQR]) as appropriate and compared using the Student's t-test and Mann-Whitney U-test in two groups.Categorical variables were compared using the Chi-squared test and Fisher's exact test.The trend of non-invasive test values across fibrosis stage was assessed by Jonckheere-Terpstra test.The performance of the non-invasive tests was assessed by calibration and discrimination in both biopsy cohort and territorywide cohort.Calibration was assessed by calibration plots.Discrimination in detecting significant fibrosis, advanced fibrosis and fibrotic NASH was measured by the area under the receiver operating characteristic curve (AUROC) and compared with Delong test or Hanley & McNeil test. 24Sensitivity, specificity, positive predictive value (PPV) and negative predictive value (NPV) were calculated to evaluate the diagnostic accuracy of the non-invasive tests and compared with Chi-square test and Fisher's exact test.Specifically, diagnostic performance of NITs was compared at cutoff values associated with sensitivity of .85 and specificity at .95 in the biopsy cohort.

F 4
Diagnostic performance of SAFE score using two cut-offs for the prediction of liver-related events at 5, 10, 15 years (N = 4603).

Table
S1 lists the ICD-9-CM codes used in the inclusion and exclusion criteria.

Table 2 and
Figure S2 compare the AUROCs of the markers.In

Characteristics Entire NAFLD cohort (n = 470) Subset with complete data (n = 279)
TA B L E 2 Diagnostic performance of different scores and non-invasive tests for the diagnosis of significant fibrosis, advanced fibrosis in 279 patients and fibrotic NASH in 270 patients a with biopsy-proven NAFLD and complete data for all different non-invasive tests.F ≥ 2 target F ≥ 2 (n = 124) vs. F0-1 (n = 155)

Table 4
presents timedependent sensitivity, specificity, PPV and NPV for SAFE in predicting

Fibrotic NASH target Fibrotic NASH (n = 80) vs. no fibrotic NASH (n = 190)
For sensitivity, specificity, PPV and NPV, numbers in brackets represent 95% confidence interval.Abbreviations: AUROC, area under the receiver-operating characteristics curves; NPV, negative predictive value; PPV, positive predictive value.aNinepatients with missing high-density lipoprotein for FNI.Diagnostic performance of blood fibrosis tests and VCTE using two cut-offs for the diagnosis of significant fibrosis in all patients with biopsy-proven NAFLD.Numbers in brackets for AUROC, sensitivity, specificity, PPV and NPV represent 95% confidence intervals.NFS is not included in this table because it does not have published cut-offs for significant fibrosis.Abbreviations: NAFLD, nonalcoholic fatty liver disease; NPV, negative predictive value; PPV, positive predictive value.