Long‐term Results Comparing Cervical Disc Arthroplasty to Anterior Cervical Discectomy and Fusion: A Systematic Review and Meta‐Analysis of Randomized Controlled Trials

Objective Whether cervical disc arthroplasty (CDA) is superior to anterior cervical discectomy and fusion (ACDF) remains controversial, especially in relation to long‐term results. The present study aimed to evaluate the long‐term safety and efficiency of CDA and ACDF for cervical disc disease. Methods We performed this study according to the Cochrane methodology. An extensive search was undertaken in PubMed, Embase, and Cochrane databases up to 1 June 2019 using the following key words: “anterior cervical fusion,” “arthroplasty,” “replacement” and “artificial disc”. RevMan 5.3 (Cochrane, London, UK) was used to analyze data. Safety and efficiency outcome measures included the success rate, functional outcome measures, adverse events (AE), adjacent segment degeneration (ASD), secondary surgery, and patients’ satisfaction and recommendation rates. The OR and MD with 95% confidence interval (CI) were used to evaluate discontinuous and continuous variables, respectively. The statistically significant level was set at P < 0.05. Results A total of 11 randomized controlled trials with 3505 patients (CDA/ACDF: 1913/1592) were included in this meta‐analysis. Compared with ACDF, CDA achieved significantly higher overall success (2.10, 95% CI [1.70, 2.59]), neck disability index (NDI) success (1.73, 95% CI [1.37, 2.18]), neurological success (1.65, 95% CI [1.24, 2.20]), patients’ satisfaction (2.14, 95% CI [1.50, 3.05]), and patients’ recommendation rates (3.23, 95% CI [1.79, 5.80]). Functional outcome measures such as visual analog score neck pain (−5.50, 95% CI [−8.49, −2.52]) and arm pain (−3.78, 95% CI [−7.04, −0.53]), the Short Form‐36 physical component score (SF‐36 PCS) (1.93, 95% CI [0.53, 3.32]), and the Short Form‐36 mental component score (SF‐36 MCS) (2.62, 95% CI [0.95, 4.29]), revealed superiority in the CDA group. CDA also achieved a significantly lower rate of symptomatic ASD (0.46, 95% CI [0.34, 0.63]), total secondary surgery (0.50, 95% CI [0.29, 0.87]), secondary surgery at the index level (0.46, 95% CI [0.29, 0.74]), and secondary surgery at the adjacent level (0.37, 95% CI [0.28, 0.49]). However, no significant difference was found in radiological success (1.35, 95% CI [0.88, 2.08]), NDI score (−2.88, 95% CI [−5.93, 0.17]), total reported AE (1.14, 95% CI [0.92, 1.42]), serious AE (0.89, 95% CI [0.71, 1.11]), device/surgery‐related AE (0.90, 95% CI [0.68, 1.18]), radiological superior ASD (0.63, 95% CI [0.28, 1.43]), inferior ASD (0.45, 95% CI [0.19, 1.11]), and work status (1.33, 95% CI [0.78, 2.25]). Furthermore, subgroup analysis showed different results between US and non‐US groups. Conclusion Our study provided further evidence that compared to ACDF, CDA had a higher long‐term clinical success rate and better functional outcome measurements, and resulted in less symptomatic ASD and fewer secondary surgeries. However, worldwide multicenter RCT with long‐term follow up are still needed for further evaluation in the future.


Introduction
A nterior cervical discectomy and fusion (ACDF) has been viewed as the gold standard procedure for cervical disc disease (CDD), including radiculopathy and myelopathy. A recent survey revealed that 84.3% of surgeons performed ACDF as the standard technique for CDD 1 . Even though successful clinical outcomes can be achieved with ACDF, postoperative complications such as pseudoarthrosis or nonunion, instrument failure, and adjacent segment degeneration (ASD) have been the greatest concerns [2][3][4] . Cervical fusion could lead to loss of range of motion at the index level and shift load to the adjacent level, then result in accelerating ASD 2,3,5 . Hilibrand et al. reported that annually 2.9% of the patients underwent anterior interbody fusion will most likely develop ASD requiring cervical intervention 2 . Thus, spinal surgeons have been attempting to find an alternative procedure to avoid these complications associated with ACDF.
A motion-preserving procedure, cervical disc arthroplasty (CDA), seems to be a good choice. CDA was initially designed using motion-preserving techniques to restore cervical physiologic biomechanical properties and alleviate the adjacent-level loads, and eventually reduces or eliminates the risk of developing ASD 6 . Clinical data showed that preoperative motion could be maintained in the long run following CDA 7 . Promisingly, recent studies have proved that CDA is cost-effective and is comparable to ACDF in long-term follow ups [8][9][10][11] . However, some disadvantages of CDA cannot be overlooked, such as heterotopic ossification, implant failure, and bone loss [12][13][14] . In addition, the revision burden of CDA was two times higher than that of ACDF 15 .
In the past 20 years, a series of randomized controlled trials (RCT) have been conducted; however, the reported results are inconsistent and have great variability. Although a few systematic reviews have been performed, researchers have failed to reach an agreement owing to varied criteria 5,[16][17][18][19][20][21][22][23][24] . Nevertheless, there is an absence of pooling of long-term results in a comprehensive meta-analysis. Therefore, this is the first study aiming at comparing CDA to ACDF with special focus on long-term safety and efficiency. The conclusions drawn from this study could provide solid evidence for the future application of CDA.
This study was approved by the Ethics Committee of The Second Xiangya Hospital of Centeral South University.

Literature Search Strategy
We followed the Cochrane methodology guidelines to perform this meta-analysis and searched PubMed, Embase, and the Cochrane Central Register of Controlled Trials (CCRCT) databases up to 1 June 2019. The keywords "anterior cervical fusion," "arthroplasty," "replacement," and "artificial disc" combined with "and/or" were used to identify any relevant studies.

Inclusion and Exclusion Criteria
The inclusion criteria were as follows: (i) patients ≥18 years old with symptomatic CDD presenting with radiculopathy and/or myelopathy; (ii) participants were treated with either CDA or ACDF; (iii) comparison was performed between CDA and ACDF; (iv) at least one efficiency and safety outcome measurement was available; and (v) prospective RCT with a follow up ≥5 years.
Articles that met the following characteristics were excluded: (i) reviews, case reports or series, editorials, conference abstracts, and retrospective studies; (ii) duplicated data publications from the same RCT; (iii) partial results with insufficient data; and (iv) non-English publications.

Literature Screening
Literature screening was performed by two independent investigators (Tu, ZM and Wang, QL). Any disagreement was discussed with another author (Hu, P) to reach consensus. After excluding duplicates, literature selection was carried out according to the inclusion and exclusion criteria based on title and abstract. Then, extensive screening of fulltext articles was performed. All RCT that compared the long-term efficiency and safety of CDA and ACDF for CDD were included.
Quality Assessment of the Included Studies Quality assessment was achieved using the criteria recommended by the Cochrane Back Review Group criteria 25 . The types of biases assessed are: four selection bias, four performance bias, two attrition bias, one detection bias, and one reporting bias. The articles scoring at least 6 of these 12 biases were considered as at low risk of bias. The last bia assessed is "Other," defined as any potential bias not detected using the previous criteria.

Data Extraction
Data extraction was performed as follows: (i) general characteristics such as first author, year of publication, number of clinical trial (NCT), enrolled patients, follow-up rate, age, sex, surgical levels, type of prosthesis, and follow-up duration were extracted; and (ii) outcome measures, including clinical success rate (overall success, NDI success, neurological success, and radiological success), functional outcome measurements (NDI score, visual analog score [VAS] neck pain and arm pain, and SF-36 PCS and MCS), AE (total reported AE, serious AE and device/surgery-related AE), ASD (symptomatic ASD, radiological superior or inferior ASD), secondary surgery (total secondary surgery, secondary surgery at the index level and at the adjacent level), work status, and patients' satisfaction and recommendation rates were extracted. This task was performed by two independent investigators (Tu, ZM and Wang, QL), who extracted the data and discussed any disagreement to reach consensus with a third investigator (Hu, P). Data-extracting software was used to obtain data from figures when original data was not available 26 .
Statistical Analysis RevMan 5.3 (Cochrane, London, UK) was used to pool extracted data into a combined analysis. The odds ratio (OR) and mean difference (MD) with 95% confidence intervals (CI) were used to evaluate discontinuous and continuous variables, respectively. Heterogeneity was assessed using a χ 2 -test and an I 2 -test. A fixed effects model was used when I 2 < 50%; otherwise, a random effects model was used. Sensitivity analysis was performed by comparing two different effects models.
If the statistical difference changed, the leave-one-out method 27 and subgroup analysis was performed to find the origin of heterogeneity. Funnel plots were applied to assess for publication bias. A statistically significant difference was defined as a Pvalue of less than 0.05.

Literature Review
Initial database searching identified 1954 articles (PubMed: 650, Embase: 1020, CCRCT: 284) and detailed literature screening is described in the flow diagram in Figure 1. A total of 814 studies were removed because they were duplicates, 1076 studies were excluded based on their titles and abstracts, and 43 studies were excluded for other reasons. As a result, 21 studies 28-48 were included for further evaluation. Among them, 2 studies 45,47 were partial results of multicenter RCT and 8 studies [39][40][41][42][43][44]46,48 included duplicated data for publication. Ultimately, 11 articles 28-38 involving 3505 patients (CDA/ACDF: 1913/1592) were included in this meta-analysis. There are 923 male and 990 female patients in the CDA group and 791 male and 801 female patients in the ACDF group. The mean age of each included population varies from 40 to 50 years in both groups. All the patients suffered from radiculopathy and/or myelopathy caused by cervical disc disease with C 3-4 to C 6-7 involvement. The basic characteristics of the included studies and patients are summarized in Table 1. Among them, 8 studies [28][29][30][32][33][34]36,38 compared single-level CDD, 1 study 31 compared two-level CDD, and 2 studies 35,37 compared both single-level and two-level CDD independently.

Quality Assessment of the Included Studies
Methodological quality assessment of the 11 eligible studies is shown in Fig. 2. Nine studies 28,[30][31][32][33][34][35][36][37] were adequately randomized, but 1 study 29 did not provide detailed information of randomization, and 1 study 38 failed to achieve adequate randomization. Only 4 studies 30,32,33,35 provided a clear statement regarding avoiding allocation concealment. In addition, all included RCT 28-38 failed to achieve blinding to patients and care providers due to the specialty of this kind of trial. The patients were informed immediately after surgery about the type of surgical procedure they had been underwent, and care providers were aware of which kind of surgery was to be performed during surgery 28,[30][31][32][33][34][35][36][37] . Almost all the studies described the dropout rate and 2 studies 28,29 with a followup rate below 70% were considered as having high risk of bias. All included studies were scored above seven and were rated as having low risk of bias.
The heterogeneity test showed that I 2 < 50% for overall success, NDI success, neurological success, radiological success, VAS neck pain and arm pain, SF-36 PCS and MCS, total reported AE, serious AE, device/surgery-related AE, symptomatic ASD, secondary surgery at the adjacent level, and patients' satisfaction and recommendation rates. This indicates that there is low heterogeneity among these parameters and a fix effects model could be applied for combined statistics. In contrast, the heterogeneity test showed I 2 > 50% for NDI score, radiological superior and inferior ASD, total secondary surgery, secondary surgery at the index level, and work status, which indicates significant or large   heterogeneity. Therefore, a random effects model could be applied for combined statistics. The results of the heterogeneity test are summarized in Table 2.

Sensitivity Analysis
Combined OR or MD with 95% CI using fixed and random effects for all outcome measures are showed in Table 3. The consistency of the combined results was identified in overall success, NDI success, neurological success, radiological success, VAS neck pain and arm pain, SF-36 PCS and MCS, total reported AE, serious AE, device/surgery-related AE, symptomatic ASD, total secondary surgery, secondary surgery at the index level and at the adjacent level, and patients' satisfaction and recommendation rates. This means that these results are stable and reliable. However, the situation was quite different for NDI score, and radiological superior and inferior ASD, indicating that the combined results were unreliable. Therefore, further analysis was performed.
Then, we performed sensitivity analysis based on the leave-one-out method 27 . For NDI score, we found that the combined result changed significantly when removing the study from Hou et al. 32 or MacDowall et al. 35 , with the Pvalue reduced from 0.06 to 0.02. Thus, we performed a subgroup analysis ( Table 4) and found that the heterogeneity was 40% and 0% in the US and non-US subgroups, respectively, indicating that the heterogeneity originated from the studies from different regions. In addition, for radiological superior ASD, after we excluded the data from Radicliff et al. (2017) 37 , I 2 decreased from 83% to 0%, and the statistical significance changed. For radiological inferior ASD, after we excluded the study from Phillips et al. 36 , I 2 decreased from 78% to 28%, and the statistical significance also changed. This indicates that they were the source of heterogeneity for radiological superior and inferior ASD, respectively.

Subgroup Analysis
First, we performed subgroup analysis based on different regions. The included studies were classified into US and non-US subgroups. The combined results of NDI score, symptomatic ASD, total secondary surgery, and secondary surgery at the index level and at the adjacent level are shown   in Table 4. Surprisingly, the combined results showed that CDA was superior to ACDF, with significant difference in all these outcome measures in the US subgroup. However, in the non-US subgroup, all these combined results were similar without statistical difference. Second, we performed subgroup analysis based on the number of surgical levels. The combined results of overall success, neurological success, NDI success, radiological success, total reported AE, serious AE, device/surgery-related AE, secondary surgery at the index level and at the adjacent level, and patients' satisfaction rate are showed in Table 5. The combined results showed significantly less device/surgery-related AE of CDA in the two-level CDD group, with no statistical difference in single-level CDD. In contrast, patients' satisfaction favored CDA in single-level CDD (P = 0.0002), while in two-level CDD (P = 0.05), further studies are needed to identify the superiority. The residual outcome measures are similar for single-level and twolevel CDD.

Assessment of Publication Bias
The funnel plot was applied to detect publication bias. As for neurological success (Fig. 9A), the funnel plots appeared symmetric and all studies were included inside, indicating that no publication bias existed. However, for secondary surgery at the adjacent level (Fig. 9B), the funnel plots appeared symmetric and 1 study was not included inside, indicating that publication bias existed.

Discussion
U p to now, CDA application in spinal practice has remained controversial. Whether CDA is superior to ACDF has not been established in the long run ASD is always associated with the length of follow up. Therefore, it is crucial to evaluate the safety and efficiency of CDA in the long run. To our knowledge, there have been several metaanalyses comparing CDA with ACDF. Most of them have included partial long-term results, but they were mixed up with short-term and mid-term results 5,[16][17][18][19][20][21][22][23] . Therefore, given the availability of newly published long-term results [28][29][30][31][32][33][34][35][36][37][38] , we performed this study. This is the first time comparing the safety and efficiency of CDA with ACDF only focusing on long-term follow-ups.
In our meta-analysis, 11 RCT with more than 5 years' follow-up were identified. Based on the quality assessment criteria recommended by the Cochrane Back Review Group 25 , all the studies were rated as low risk of bias. However, blinding to patients and care providers was not appropriately achieved in any studies. In addition, only 4 studies 29,31,32,35 achieved blinding to outcome evaluators. This may result in reporting bias. Heterogeneity definitely existed in the included studies. First, various different types of CDA devices were used in the 11 RCT, including A B Fig. 9 The funnel plot of neurological success (A) and secondary surgery at the adjacent level (B). CI, confidence interval. Kineflex|C 29 , Bryan 30,34 , Discover 35 Secure-C 38 , Prestige 28,31 , Mobi-C 32,37 , ProDisc-C 33 , and PCM 36 , differing in design and biomechanical properties. Second, the surgical level was different among studies. A total of 8 studies compared onelevel CDD [28][29][30][32][33][34]36,38 , 1 study compared two-level CDD 31 , and 2 studies compared both one-level and two-level CDD 35,37 . Third, the region of studies was also different. Eight studies 28,29,31,33,34,[36][37][38] were conducted in the US and just 3 studies 30,32,35 were out of the USA. Fourth, evaluation criteria of outcome measures varied among studies. Thus, we performed a sensitivity analysis including comparing two different effect models, using the leave-one-out method 27 and subgroup analysis to find the origin of heterogeneity. The combined results of radiological superior and inferior ASD were not stable and reliable and should be considered with caution. One possible reason is that only 2 studies reported this outcome 36,37 . Although no publication bias existed in neurological success, publication bias existed in secondary surgery at the adjacent level.
After 5 years' follow up or more, our study revealed that CDA achieved a higher rate of clinical success and better functional outcome measurements with statistical significance, except for NDI score. A mid-term to longterm meta-analysis conducted by Hu et al. 17 compared 4-7 years' clinical results, pooling data from 8 RCT, and showed that CDA achieved a significantly higher clinical success rate and better functional outcome. Similarly, Gao et al. 5 compared 2-5 years' clinical results, pooling data for 14 RCT for analysis, and found that CDA was superior in VAS pain scores and neurological success, but NDI scores remained similar. In addition, major functional outcome measurements of CDA proved to have no obvious benefits when pooling 1-2 years' data into the analysis 24 . This difference may originate from the different follow-up duration. Theoretically, CDA shares the same procedure of discectomy, endplate preparing, and decompression. VAS arm pain should be similar. However, VAS arm pain score was favored for CDA at the final follow up.
Adverse events are another major concern when applying CDA. Our results showed no statistical difference in total reported AE, serious AE, and device or surgery-related serious AE. This finding is consistent with some previous metaanalyses 5,18,23 but contrary to others 17 . This difference can be explained by the different inclusion criteria for each study. Our study was focused on the long-term data and only enrolled RCT with more than 5 years' follow-up. Undeniably, pseudoarthrosis would not occur after CDA, but heterotopic ossification and bone loss became new problems 12,14 . A recent systematic review 14 showed that the long-term heterotopic ossification rate after CDA was 53.6% and the severe (grade 3 and 4) heterotopic ossification rate was 47.5%. In addition, the severe heterotopic ossification rate was significantly associated with follow-up time, with a 0.63% increase per month growth 14 . Bone loss was as high as 60.4%, although it did not affect mid-term to long-term clinical outcomes 12 . This might be the reason why surgeons did not feel confident recommending CDA as a standard option 30 . Moreover, it could explain the similar incidence of AE between CDA and ACDF.
Adjacent segment degeneration is the most important factor to be considered. The initial purpose of designing CDA was to prevent ASD after surgery. The biomechanical advantages have been well established 3,49 . A recent metaanalysis showed that there was no statistically significant difference in ASD between CDA and ACDF within 24-months' follow-up period, but ASD was significantly lower with an increase of follow-up duration in CDA 16 . In contrast, Xu et al. 21 and Zhu et al. 23 found that CDA was superior in reducing the ASD incident rate when compared with ACDF, and this superiority became more apparent over time 21 . Although these 3 studies 16,21,23 attempted to evaluate ASD and symptomatic ASD separately, the follow-up period was not separated clearly, and long-term results were weak. Our results show that CDA has significantly lower symptomatic ASD. However, when we pooled all data together, there was no statistical difference in radiological superior ASD between CDA and ACDF. Interestingly, Ren et al. 20 found that ASD was not significantly different between CDA and ACDF with a smaller sample. Nunley et al. 50 (2018) summarized biomechanical and clinical evidence from worldwide application of CDA and concluded that CDA decreased the rate of radiographic adjacent segment pathology by alleviating adjacentlevel stress. However, the reason why subgroup analysis showed no significant difference in the non-US group is still difficult to explain.
Increased attention has been focused on the secondary surgery rate. Ghobrial et al. 40 found that fewer patients with the Bryan disc required surgery for symptomatic ASD when compared with ACDF without statistical significance at 10 years' follow-up. However, they performed combined analysis using Bryan and Prestige artificial discs and found significant differences in symptomatic ASD requiring surgery as early as after 7 years 40 . Surprisingly, MacDowall et al. 51 conducted a retrospective study based on a Swedish database and found that CDA had a similar secondary surgery rate at the adjacent level but a higher secondary surgery rate at the index level with significant difference. However, based on our long-term results, CDA had a significantly lower rate of total secondary surgery, secondary surgery at the adjacent level, and secondary surgery at the index level, which is consistent with mid-term to long-term results 17 However, this finding is contrary to the short-term to mid-term result reported by Zhang et al. 52 that the secondary surgery rate at the adjacent level showed no significant difference. It seems that CDA exhibited superiority in reducing secondary surgery through restoring favorable physiological biomechanical properties in the long-term follow-up. However, it is important to note that our subgroup analysis also showed no statistical difference in the secondary surgery rate in the non-US group.
Several limitations may exist in this study. First, due to our focus on long-term results, only 11 RCT were included and 8 of them were conducted in the USA. Therefore, our study may not reflect the worldwide results and may result in bias. In addition, larger size samples are needed in future studies. Second, although all included studies were rated as low risk of bias based on the Cochrane Back Review Group, all of them failed to achieve sufficient blinding and the allocation concealment was rarely clearly described. Third, high heterogeneity exists in NDI score, radiological superior ASD and inferior ASD. Our sensitivity analysis results revealed that radiological superior ASD and inferior ASD were not stable and, therefore, should be considered with caution. Finally, subgroup analysis showed different results for NDI score, symptomatic ASD, total secondary surgery, secondary surgery at the index level, and secondary surgery at the adjacent level between US and non-US regions. Therefore, well-designed worldwide multi-center RCTs with long-term follow-ups are still needed for further evaluation in the future.

Conclusion
Our study provided further evidence that CDA is superior in achieving long-term clinical outcomes such as overall success, NDI success and neurological success, VAS neck pain and arm pain, SF-36 PCS and MCS, symptomatic ASD, total secondary surgery, and secondary surgery at the index level and at the adjacent level. However, no clear benefit could be identified in regard to NDI score, total reported AE, serious AE, device/surgery-related AE, and radiological superior and inferior ASD. Well-designed worldwide RCT with long-term follow up are still necessary for further evaluation in the future.