Progression‐free survival assessed per immune‐related or conventional response criteria, which is the better surrogate endpoint for overall survival in trials of immune‐checkpoint inhibitors in lung cancer: A systematic review and meta‐analysis

Abstract Progression‐free survival (PFS) has been used as a surrogate endpoint for overall survival (OS) in lung cancer trials. The pattern of response to immune‐checkpoint inhibitors (ICIs) differs from that to conventional chemotherapy, so immune‐related response evaluation criteria were proposed. This study aims at determining which PFS measure, PFS assessed per immune‐related response evaluation criteria (iPFS), or conventional criteria (cPFS), is the better surrogate endpoint for OS in trials of ICIs in lung cancer. We selected clinical trials in lung cancer that administered ICIs to at least one arm and reported both median OS and median PFS from PubMed, Embase, and The Cochrane Library. We compared the correlation between treatment effect (hazard ratio) on OS and cPFS or iPFS and the correlation between median OS and median cPFS or iPFS using weighted linear regression at trial level. We analyzed 78 ICI arms (13,438 patients) from 54 studies, including 66 arms with cPFS, seven arms with iPFS, and five arms with both kinds of PFS. We demonstrated an excellent correlation between treatment effect (hazard ratio) on OS and iPFS (R WLS 2 = 0.91), while the correlation was moderate for cPFS (R WLS 2 = 0.38). Similarly, the correlation between median OS and median iPFS was also strong (R WLS 2 ranging from 0.86 to 0.96) across different phases of trials and different types of lung cancer, ICI, and treatment modalities, while it was much weaker for median cPFS (R WLS 2 ranging from 0.28 to 0.88). In conclusion, iPFS provides better trial‐level surrogacy for OS than cPFS in trials of ICIs in lung cancer.


| INTRODUCTION
Lung cancer ranked first worldwide in both incidence and mortality among all malignancies in 2018. 1 For advanced or recurrent lung cancer, the prognosis is still poor. Cytotoxic drugs have limited effect on advanced or recurrent lung cancer. Over the past few years, immunecheckpoint inhibitors (ICIs), including anti-PD-1, anti-PD-L1, and anti-CTLA-4 antibodies, have shown favorable efficacy in both advanced non-small cell lung cancer (NSCLC) and extensive-stage small cell lung cancer (SCLC). 2,3 More trials investigating ICIs in advanced lung cancer are ongoing.
Overall survival (OS) is the gold standard in the evaluation of efficacy in oncology clinical trials. Although the measurement of OS is simple and reliable, the treatment effect on OS can be diluted by cross-over, successive lines of therapy after progression, and non-cancer-related death, therefore usually larger samples are required in order to detect OS differences across treatment arms in clinical trials. Moreover, evaluation of OS usually requires a long time to follow-up. Thus, under the circumstances of rapid development and urgent demand of novel immunotherapies, appropriate surrogate endpoints such as progression-free survival (PFS) are expected to be applied to assessing the clinical benefit over a shorter period, thereby accelerating the development and introduction of new regimens and drugs into real-world clinical practice. PFS has been used as a surrogate endpoint in the trials in lung cancer at both trial level and individual-patient-data level. [4][5][6][7][8] Assessment of treatment effect on PFS is based on the determination of response or progression. However, criteria for evaluation of response vary greatly. The World Health Organization (WHO) criteria published in 1979 assess the patient as showing complete response, partial response, stable disease, or progressive disease according to two dimensions, namely, changes in size and number of lesions. 9 The Response Evaluation Criteria in Solid Tumors (RECIST) specifications, published in 2000, presented measures instead along a single dimension and refined some other details. 10 In 2009, Response Evaluation Criteria in Solid Tumors version 1.1 (RECIST v1.1) further updated the assessment of tumor burden and lymph nodes, and the confirmation of response based on new clinical evidence. 11 These are the most frequently used conventional response evaluation criteria in chemotherapy trials. However, the pattern of response to immunotherapy differs from that of response to conventional chemotherapy. Immunotherapy usually takes a longer lag time for a suitable response. 12 Meanwhile, some patients receiving ICIs might experience enlargement of preexisting lesions or presence of new lesions during the initial phase of treatment due to transient immune cell infiltration and accumulation of cancer cell debris, which is known as pseudoprogression. 13 The response rate of immunotherapies will be underestimated if assessed per conventional criteria. 14 Thus, the new response evaluation criteria designed for immunotherapies are warranted to capture actual progression and identify real efficacy in patients receiving ICIs. In 2009, the immune-related response criteria (irRC) was proposed based on the WHO criteria. 12 The key point of irRC is 'wait-and-see'. Considering the phenomena of delayed response and pseudoprogression in ICI therapies, immunotherapy will not cease right after the advent of progression assessed per conventional criteria, and assessment of progressive disease is required to be confirmed with a repeated scan at least 4 weeks later. In 2014 and 2017 respectively, two new immune-related response evaluation criteria, immune-related Response Evaluation Criteria in Solid Tumors (irRECIST) and iRE-CIST, were published. 15,16 Although more details have been refined in these new criteria, the key idea of waitand-see has not changed.
These immune-related response evaluation criteria were designed based on the atypical pattern of response in patients receiving ICIs, but whether they perform better than conventional response evaluation criteria in assessment of efficacy or clinical benefit in trials of ICIs in lung cancer has not previously been validated at trial level. To assess the survival benefit of an intervention based on the treatment effect on surrogate endpoints, there should be a strong and robust correlation between surrogate endpoints and OS. Thus, this systematic review and metaanalysis compares the trial-level correlation between OS and PFS assessed per conventional or immune-related response evaluation criteria to determine which is the better surrogate endpoint for OS in trials of ICIs in lung cancer.

| Eligibility criteria
The eligible studies met the following PICOS (participants, interventions, comparisons, outcomes, and study design) criteria. (1) The participants were patients with primary lung cancer (including NSCLC and SCLC). (2) At least one arm of the trial was treated with regimens including ICIs such as the anti-CTLA-4 inhibitors and anti-PD-1/PD-L1 inhibitors. (3) The comparisons were not restricted. (4) At least one arm investigating ICIs of the study reported both median PFS and median OS. (5) The study type was limited to prospective clinical trials. The language was not restricted.

| Search strategy and study selection
We searched PubMed, EMBASE, and The Cochrane Library for all eligible clinical trials from inception to 4 July 2020. Search terms included 'lung cancer', 'nivolumab', 'cemiplimab', 'avelumab', 'atezolizumab', 'durvalumab', 'PD-1', 'PD-L1', and 'CTLA-4'. Duplicate publications were excluded. For multiple publications or results from a single trial in the same patient population, only the latest publication or result was included. Pooled analyses from more than one trial were also excluded. Two investigators (Guang-Li Zhu and Kai-Bin Yang) screened the titles and abstracts for potentially eligible studies and then screened the full text of these studies to select fully eligible studies independently. Conference abstracts providing sufficient information were also included. Disagreements between investigators were resolved by consensus or referring to a third investigator (Liang Peng).

| Data extraction
Two independent investigators (Guang-Li Zhu and Kai-Bin Yang) extracted the following data from eligible studies: clinical trial registration number, any other name of the trial, phase of clinical trial, type of lung cancer, stage of lung cancer, enrollment period, median follow-up time, number of arms, intervention in each arm, dose of ICIs, intention-to-treat sample size of each arm, hazard ratios (HR) for PFS or OS, median OS, median PFS, and criteria for evaluation of response. Missing information could be retrieved from registers such as clinicaltrial.gov according to clinical trial registration number when available.
Disagreements between the two investigators were resolved by consensus or referring to a third investigator (Liang Peng).

| Outcome of interest
Clinical outcomes analyzed were OS and PFS. OS was defined as the time from randomization or initiation of treatment until death from any cause. PFS was defined as the time from randomization or initiation of treatment to first progression (locoregional or distant) or death from any cause. According to different response evaluation criteria, the PFS could be denoted as cPFS (assessed per conventional response evaluation criteria) or iPFS (assessed per immune-related response evaluation criteria). For each comparison between an ICI arm and another arm, the HR for OS and the HR for cPFS/iPFS were paired. For each arm investigating ICIs, the median OS and median cPFS/iPFS were paired.

| Statistical analysis
We performed the analysis at the trial or arm level, without individual patient-level data incorporated. Analysis of trial-level correlation between OS and PFS included only the treatment arms investigating ICIs. We applied the weighted linear regression model to quantify the triallevel correlation between the HR of OS and iPFS/cPFS after logarithmic transformation. Missing HRs of OS and iPFS/cPFS were not imputed. Points were weighted by the intention-to-treat sample size. We also calculated the surrogate threshold effect (STE) for both criteria. STE is the minimal treatment effect on the surrogate endpoint explaining a nonzero effect on the true endpoint, which is yielded by intersecting the upper prediction limit curve and the horizontal line where HR OS = 1 (zero effect). Besides, we also applied the linear regression model weighted by sample size to quantify the trial-level correlation between median OS and median iPFS/cPFS. Furthermore, considering the heterogeneity across different phases of trials and different types of lung cancer, ICI and treatment modalities, we performed several sensitivity analyses that stratified the treatment arms by (1) type of lung cancer (SCLC, NSCLC); (2) phase of clinical trials (phase 1 or 1b trials, phase 2 trials); (3) types of ICI (anti-PD1 or PD-L1, anti-CTLA-4, dual ICI); (4) treatment modalities (ICI alone, ICI + chemotherapy). Only the groups including more than three studies will be included in the stratified analysis. We calculated the weighted coefficient of determination of linear regression (R WLS 2 ) to quantify the variation of OS explained by the iPFS/cPFS. We assessed the strength of correlation as excellent ( Meanwhile, to ensure the robustness of regression and correlation, we applied leave-one-out cross validation to each weighted linear regression model and calculated the R 2 of leave-one-out cross validation (R LOO 2 ), root mean squared error, and mean absolute error. Finally, the possibility of publication bias was assessed by visual estimate of the funnel plot and Egger's test when at least 10 trials were pooled.
All statistical analyses were performed by R software (version 3.6.2).

| Selection of studies
After excluding duplicates, we initially identified a total of 1521 records from PubMed, EMBASE, and The Cochrane Library. After screening the abstracts and titles, we excluded 1241 records, of which 1090 were not results of clinical trials, 30 did not report data on lung cancer, nine did not investigate ICIs, and 112 were not the latest of multiple publications all based upon the same trials. We conducted full-text review for the remaining 280 potentially eligible studies, among which 226 did not report both median OS and median PFS. Finally, a total of 54 eligible studies  were included for analysis ( Figure 1).

| Study characteristics
An overview of the included studies is presented in Table  S1. A total of 78 arms from all 54 studies, including 13,438 patients, investigated regimens containing ICIs, among which four arms were evaluated per both irRC and modified WHO criteria, one arm was evaluated per irRC and RECIST v1.1, six arms were evaluated per irRC, and one arm was evaluated per irRECIST. For the remaining 66 arms, two arms were evaluated per modified WHO criteria, and 64 arms per RECIST v1.1.  Table 2 presents the result of weighted linear regression between HR of OS and median iPFS/cPFS for all arms. After logarithmic transformation, the HR of OS had a stronger linear correlation with the HR of iPFS (R WLS 2 = 0.91) than with the HR of cPFS (R WLS 2 = 0.38; Figure 2). Leave-one-out cross validation also confirmed this conclusion. Although the R LOO 2 of 0.77 for weighted linear regression between HR of median iPFS and median OS was lower than R WLS 2 , it still indicates a very good relationship. And the R LOO 2 and R WLS 2 are the same for the weighted linear regression between HR of median cPFS and median OS, indicating a robust but moderate correlation. Given the limited availability of HR of iPFS, sensitivity analysis was not performed. The STEs for the HR of iPFS and cPFS were 0.75 and 1.21, respectively, which are the maximal HR for observed iPFS and cPFS needed to report possibly significant treatment effect on OS.

| Trial-level correlation between median OS and median iPFS/cPFS
For the ICI arms evaluated by conventional criteria, the mean and standard deviation of median OS and median cPFS were 12.98 ± 5.33 months and 4.30 ± 2.39 months, respectively, whereas the mean and standard deviation of median OS and median iPFS were 10.14 ± 3.44 months and 4.93 ± 1.75 months for the ICI arms evaluated by immune-related criteria. Table 3 demonstrates the result of weighted linear regression between median OS and median iPFS/cPFS for all arms and different subgroups. The correlation between median OS and median iPFS was very good (R WLS 2 = 0.88; Figure 3A), while the correlation between median OS and median cPFS was weaker (R WLS 2 = 0.55) but still good ( Figure 3B). Outliers are data points with studentized residual outside the ±2 range. There are two notable outliers: studies reported by Peters et al. 35 and Goldman et al. 52 After we excluded these two studies, there was a slight change in the slope and intercept of the weighted linear regression model but an obvious increase in R WLS 2 to 0.62 ( Figure 3C). In the leave-one-out cross validation, the correlation between median OS and median cPFS (R LOO 2 = 0.40), or even the correlation after excluding the two outlier studies (R LOO 2 = 0.62), was still weaker than the correlation between median OS and median iPFS (R LOO 2 = 0.81).

| Sensitivity analyses
We first performed sensitivity analysis according to the type of lung cancer. For arms investigating SCLC, the median iPFS (R WLS 2 = 0.86) showed a better correlation with median OS than did median cPFS (R WLS 2 = 0.67).
Excluding the studies by Peters et al. 35 and Goldman et al. 52 improved the correlation between median OS and median cPFS in NSCLC (R WLS 2 = 0.59), but it was still much weaker than the correlation between median OS and median iPFS. Leave-one-out cross validation also confirmed this conclusion. R LOO 2 improved from 0.27 to 0.56 after removal of the studies by Peters et al. 35 and Goldman et al., 52 indicating an increase in robustness of correlation between median OS and median cPFS, but still did not match the excellent correlation between median OS and median iPFS (R LOO 2 = 0.93).
Besides, we also performed sensitivity analysis according to the phase of clinical trials. In the trials assessed per conventional criteria, phase 3 trials (44.7%) predominated, followed by phase 2 trials (23.4%) and phase 1 or 1b trials (23.4%), whereas in the trials assessed per immune-related criteria, phase 2 trials (58.3%) predominated, followed by phase 1 trials (41.7%). For arms from phase 1 or phase 1b trials, the median iPFS showed a much stronger correlation with median OS (R WLS 2 = 0.92) than did the median cPFS (R WLS 2 = 0.51). After removal of the outlier study by Goldman et al., 52 the R 2 of weighted linear regression between cPFS and OS (R WLS 2 = 0.76) increased, and the R LOO 2 also improved significantly to 0.66 in leave-one-out validation. Similarly for arms from phase 2 trials, the median iPFS still showed a better correlation with median OS (R WLS 2 = 0.86) than did median cPFS (R WLS 2 = 0.28).
And the removal of the outlier study by Peters et al. 35 mildly improved the R 2 of weighted linear regression to R WLS 2 = 0.41 and of leave-one-out cross validation to Finally, we performed sensitivity analysis based on the types of ICI and treatment modalities, respectively. Regarding the types of ICI, concerning that only 1 trial used anti-PD-L1 antibody among the trials assessed per immune-related criteria, we combined anti-PD1 and anti-PD-L1 antibody as one group. The correlation between median OS and median iPFS were very good and excellent    and anti-PD1/anti-PD-L1 group (R WLS 2 = 0.57), respectively. As for the types of treatment modalities, limited by the number of trials, the stratified analysis could not be performed in ICI alone group for iPFS and ICI + radiotherapy group for both iPFS and cPFS. We detected a moderate and very good correlation between OS and cPFS in ICI alone group (R WLS 2 = 0.46) and ICI + chemotherapy group (R WLS 2 = 0.84), respectively. However, the correlation between OS and iPFS in ICI + chemotherapy (R WLS 2 = 0.86) group was stronger. In addition, correlation between median OS and median iPFS (R LOO 2 ranging from 0.65 to 0.84) were also more robust than median OS and median cPFS (R LOO 2 ranging from 0.28 to 0.79) as demonstrated in the leave-one-out cross validation.

| Assessment of publication bias
The funnel plot is highly symmetric, and Egger's test shows no evidence of publication bias in the arms reporting HR of OS (p = 0.66) and PFS (p = 0.64) assessed per conventional response evaluation criteria ( Figure S1).
Fewer than 10 arms reported HR of OS and PFS assessed per immune-related response evaluation criteria, so their publication bias was not evaluated.

| DISCUSSION
The first ICI used to treat advanced NSCLC, nivolumab, was approved in 2015. In 2016, pembrolizumab was approved as a first-line treatment option for metastatic NSCLC. Great breakthroughs by the emerging ICIs in prolonging the survival of patients with advanced lung cancer have elicited a rapidly increased number of trials of ICIs in lung cancer. An appropriate surrogate endpoint for OS to predict clinical benefit at an early phase of trials and accelerate patients' access to new ICIs is therefore urgently needed. In this study, we aimed at comparing the correlation between OS and PFS assessed per immune-related and conventional response evaluation criteria in lung cancer patients receiving ICIs at trial level. The trial-level correlation between the HR of OS and cPFS was moderate (R WLS 2 = 0.36). The trial-level correlation between OS and cPFS was worse than in previously reported studies, which included only trials of conventional chemotherapy. [4][5][6]8 Similarly, there is a good or very good trial-level correlation between median OS and median cPFS (R WLS 2 ranging from 0.51 to 0.80) for all arms and subgroups except phase 2 trials or trials using ICI alone, which only exhibited moderate correlation even after the removal of the outlier studies. However, in the leave-one-out cross validation for all arms and subgroups, we only detected a R LOO 2 greater than 0.5 in three subgroups, indicating a moderate power of prediction. The atypical pattern of response and various regimens of ICIs with or without conventional cytotoxic drugs might lower the predictive value of cPFS as a surrogate marker under these less restrictive circumstances. Thus, the above evidence provides only moderate support for considering cPFS as an appropriate surrogate endpoint for OS in trials of ICIs in lung cancer. Conversely, the trial-level correlation between the HR of OS and iPFS was excellent (R WLS 2 = 0.91), which was validated by the leave-one-out cross validation (R LOO 2 = 0.77). STE was 0.75 for iPFS, indicating that with an HR of iPFS lower than 0.75 we could predict a statistically significant HR of OS. Moreover, the considerably higher STE of 1.21 for cPFS means that even patients with worse cPFS under immunotherapy than control treatment (PFS HR >1), can derive statistically significant OS benefit with immunotherapy compared to control treatment, indicating the underestimation of the OS benefit of immunotherapy with the cPFS evaluation per conventional criteria. Moreover, we also demonstrated a very good F I G U R E 2 Weighted linear regression between treatment effect (hazard ratio) on OS and iPFS (A) and cPFS (B) after logarithmic transformation. Each circle represents a study, whose size is proportional to the intention-to-treat sample size. cPFS, progression-free survival assessed per conventional response evaluation criteria; iPFS, progression-free survival assessed per immune-related response evaluation criteria; OS, overall survival trial-level correlation between the median OS and median iPFS for all arms (R WLS 2 = 0.88). And the R LOO 2 was greater than 0.60 for all arms and subgroups in the leaveone-out cross validation. The strong and robust trial-level correlation suggest that median iPFS and HR of iPFS are appropriate surrogate endpoints in trials of ICIs in lung cancer. Another meta-analysis including 14 randomized controlled trials with patients across five types of cancers reported a slightly stronger but still moderate trial-level correlation between iPFS and OS (R 2 = 0.277) compared with the correlation between cPFS and OS (R 2 = 0.260), which might be associated with heterogeneity of patterns of survival in patients across different types of cancer. 69 The importance of adopting a response evaluation criteria adaptive to the unique pattern of response to immune-checkpoint inhibitors has not received adequate attention. Although it has been over 10 years since the

T A B L E 3 (Continued)
F I G U R E 3 Weighted linear regression between median OS and iPFS (A) and cPFS (B) and cPFS after removal of two outlier studies (C). Each study is represented by a circle whose size is proportional to the intention-to-treat sample size. cPFS, progression-free survival assessed per conventional response evaluation criteria; iPFS, progression-free survival assessed per immune-related response evaluation criteria; OS, overall survival publication of the irRC, RECIST v1.1 is still the most frequently used set of criteria for response evaluation in clinical trials of ICIs. The immune-related response evaluation criteria were mainly used in early phases of trials, and no result of phase 3 trials evaluated per immunerelated response evaluation criteria has yet been identified. There were several reasons why immune-related response criteria were still not widely used in ICI trials. First, prospective randomized trials comparing conventional and immune-related response criteria have yet to be conducted, so there was no confirmatory evidence of the superiority of immune-related response criteria over conventional criteria. In fact, the irRC guidelines did not claim its superiority in the response evaluation in ICI trials, but only recommended prospective validation of the new criteria in the future trials. 12 RECIST criteria are still the most widely used and recognized criteria. Second, implementation of trials with immune-related criteria requires more precautions due to the risk of continuing the treatment after documented progression. Third, the immune-related response criteria were made partly in response to the findings of atypical patterns of response to ICI. However, some oncologists think that the rates of pseudoprogression, which was reported to be less than 10%, 70 is insufficient to lead to a significantly difference in the assessment of PFS, while some oncologists thought that PFS in ICI trials would be better assessed with these new criteria. Thus, this study tries to provide an insight into the importance of immune-related response evaluation criteria in immune-checkpoint inhibitor trials even with limited availability of studies assessed per immunerelated response evaluation criteria. The strength of the evidence of this study was mainly restricted by the limited availability of studies assessed per immune-related response evaluation criteria. The gold standard of surrogate endpoint analysis is the correlation between HR on cPFS or iPFS and HR on OS. However, the lack of phase 3 trials results in a lack of direct comparison of correlation for both types of PFS, which leads to insufficient data on HR. And further analysis, including sensitivity analysis, could not be performed for the correlation between the HR of OS and iPFS. In this study, to further validate the superiority of immune-related response evaluation criteria over conventional criteria, we additionally analyzed the correlation between medians of cPFS or iPFS and medians of OS to make more confirmatory conclusions. Besides, another concern with the lack of Phase III trials with iPFS in this analysis is that the correlation between OS and PFS could be confounded by trial phase. In this study, the lower correlation and poorer prediction power of cPFS could be due to the broader and more heterogeneous patients enrolled in phase 3 trials, while phase 1 and 2 trials are usually more stringent. However, the strong correlation between median OS and median iPFS persisted in the subgroups of different phases of trials and different types of lung cancer, ICI, and treatment modalities (R WLS 2 ranging from 0.86 to 0.96), suggesting that the correlation between median OS and median iPFS might not be influenced by differences in phase of trials and types of lung cancer, ICI, and treatment modalities. Furthermore, a surrogate endpoint for OS would be more useful in phase 3 trials than phase 1 or 2 trials, because it usually takes a much longer time to follow-up in phase 3 trials. This research also has other limitations. First, although an appropriate surrogate endpoint should be validated at both trial level and individual-patient-data level, we did not incorporate individual-patient-data due to the lack of corresponding data. For the design of trials for ICIs in the future, the adoption of immune-related response evaluation criteria should be considered, and the surrogacy of iPFS should be further validated at individual-patient-data level in the future studies. Second, the update in iRECIST has overcome many disadvantages of previous criteria including irRC and irRECIST, but no data on trials assessed per iRECIST are available. Thus, whether the update in iRECIST could improve the correlation between OS and PFS compared with irRC or irRECIST requires further study.
In conclusion, this systematic review and metaanalysis demonstrates a strong trial-level correlation between treatment effect (HR) on OS and iPFS. Similarly, a strong and robust trial-level correlation between median OS and iPFS across different phases of trials and different types of lung cancer, ICI, and treatment modalities in trials of ICIs in lung cancer were also presented. It suggests that iPFS provides valid and robust surrogacy for OS in trials of ICIs in lung cancer. Conversely, the moderate correlation between OS and cPFS provides only modest support for adopting cPFS as surrogate endpoint for OS in trials of ICIs in lung cancer. The conclusion should be further validated at the individual-patient-data level and phase 3 trials.

CONFLICTS OF INTEREST
The authors declare that there is no conflict of interest.

ETHICAL STATEMENT
Ethics approval was not required for this study.

DATA AVAILABILITY STATEMENT
The data that support the findings of this study are openly available in Pubmed at https://pubmed.ncbi. nlm.nih.gov/, Embase at https://www.embase.com/, and Cochrane library at https://www.cochr aneli brary. com.