The data given in this work are the results of the AIEOP-BFM-ALL-FCM-MRD-Study Group.
How to cite this article: Dworzak MN, Gaipa G, Ratei R, Veltroni M, Schumich A, Maglia O, Karawajew L, Benetello A, Potschger U, Husak Z, Gadner H, Biondi A, Ludwig W-D, Basso G. Standardization of flow cytometric minimal residual disease evaluation in acute lymphoblastic leukemia: multicentric assessment is feasible. Cytometry Part B 2008; 74B: 331–340.
Single-laboratory experience showed that flow cytometric (FCM) assessment of minimal residual disease (MRD) in acute lymphoblastic leukemia (ALL) is feasible in most patients and gives independent prognostic information. It is, however, not known whether FCM analysis can reliably be standardized for multicentric application.
An extensive standardization program was installed in four collaborating laboratories, which study FCM-MRD in children treated with the AIEOP-BFM-ALL 2000 protocol. This included methodological alignment, continuous quality monitoring, as well as personnel education by exchange and performance feed-back.
Blinded inter-laboratory tests of list-mode data interpretation concordance (n = 202 blood and bone marrow samples from follow-up during induction of 31 randomly selected patients of a total series of n = 395) showed a very high degree of inter-rater agreement among the four centers despite differences in cytometers and software usage (intraclass correlation coefficient [ICC] 0.979 based on n= 800 single values). Lower concordance was reached with amounts of MRD below 0.1%. Comparing data from sample exchange experiments (n = 42 samples; ICC 0.98) and from independent patient cohorts from the four centers (regarding positive samples per time-point of follow-up as well as risk estimates) concordance was also good.
Treatment-response assessment in acute lymphoblastic leukemia (ALL) via quantification of minimal residual disease (MRD) has become a corner-stone for risk stratification in several current therapy protocols worldwide (1–3). Flow cytometry (FCM) is one of the methodologies most useful in this respect, because it is applicable to nearly 100% of patients and gives independent prognostic information, which overrides classical risk factors (4–7). This has been evidenced up to now by single-laboratory studies only. The Biomed-1 consortium has described measures for standardization in collaborative MRD studies from sample preparation and staining to acquisition on flow cytometers (8, 9). It is, however, not known to date whether list-mode data interpretation (the human factor in FCM) can reliably be standardized for MRD detection, so that multiple laboratories could concertedly provide MRD information of similar quality in large and international treatment trials. It is also not known, whether data acquisition and analysis on different cytometers and with different software programs impacts the results.
The collaboration of four FCM-laboratories in Berlin, Monza, Padova, and Vienna was initiated in May 2000, to make flow cytometric MRD assessment available for application in the international therapy trials of the allied study groups Associazione Italiana Ematologia Oncologia Pediatrica (AIEOP) and Berlin-Frankfurt-Münster (BFM). AIEOP-BFM ALL trials recruit ∼1000 pediatric patients with ALL per year from four European countries. This makes multicentric FCM evaluation necessary to break down work-load and ensure preserved sample quality by a short transit time from bed-side to laboratory bench. In this report, we delineate our steps of standardization and present our four-center concordance data in assessing MRD. The data show that concerted multicenter MRD-assessment by FCM is feasible and reliable.
MATERIALS AND METHODS
Standardized Flow Cytometry
The standardization process included the following steps:
1Standardized operating protocols for sample preparation and staining as described recently (10, 11). To warrant reproducibility, commercially available products were used for erythrocyte lysis (BD FACS Lysing Solution™, Becton Dickinson [BD] Biosciences, San Jose, CA) as well as for permeabilization (Fix & Perm™, Caltag Laboratories, Hamburg, Germany).
2Standardization of monoclonal antibodies (MoAbs) for manufacturer, clone, and partly for fluorochrome (see supplemental Table 1). All MoAbs were selected for high-quality performance after screening several similar products from different sources. Because of the differences in cytometers (Berlin and Padova: Epics XL™ and FC 500™, respectively, both from Beckmann Coulter [BC], Miami, FL; Monza and Vienna: FACSCalibur™ from BD), the fluorochromes could not be kept constant for the MoAbs used in the third and fourth fluorescence channel. On BC cytometers we used the fluorochromes Phycoerythrin-Texas-red (ECD) and phycoerythrin-cyanin 5 (PE-Cy5), and on BD machines phycoerythrin-cyanin 7 (PE-Cy7) and allophycocyanin (APC).
3MoAbs were strategically assorted to fixed quadruple combinations of those markers, which have been proven highest relevance for MRD studies in ALL (5, 8–10, 12–15). The panel of combinations was limited to a maximum of five tubes for BCP-ALL: CD20/CD10/CD19/CD34, CD58/CD10/CD19/CD34, CD10/CD34/CD19/CD45, CD10/CD11a/CD19/CD45, and CD10 ± CD20/CD38/CD19/CD34 (ordered by Channels 1–4). Three combinations were used for T-ALL: CD99/CD7/CD5/surface(s)CD3, CD99/CD7/sCD3/cytoplasmic(c)CD3, and TdT/CD7/sCD3/cCD3. Repetitive triple-marker backbones were useful for stable discrimination of similar cellular phenotypes in different tubes. To avoid influences of fluorochrome interactions on MRD detection (particularly relevant with tandem-dyes) (16), we preferred fluorescein isothiocyanate (FITC; Channel 1) or phycoerythrin (PE; Channel 2) as labels of MoAbs against the most relevant aberrations. From this panel of combinations, at least two were chosen for follow-up of individual patients according to the leukemic phenotype at diagnosis and during follow-up [thus respecting expression shifts (11) in subsequent analyses].
4Quality control: The instrumental setup was optimized daily by analyzing Calibrite™ beads (BD) and normal adult peripheral blood (PB) cells stained with CD4/CD8/CD3/CD45 (8). The same staining was also done with each patient sample as an internal control (17, 18). DAKO Fluorospheres™ (type IIIa beads) with assigned values of molecules of equivalent soluble fluorochrome were used for intracenter longitudinal monitoring of instrument performance stability (see supplemental Fig. 1) (17, 19). Erythrocyte lysis efficiency was monitored per sample by an extra staining combination of SYTO®16 (from Molecular Probes, Leiden, The Netherlands), a live-cell-permeant nucleated-cell dye (Channel 1), together with CD10 or CD7/CD19 or CD3/CD45 (5). The time from sampling to processing for analysis was recorded for each sample for quality correlations.
5Immunophenotyping at diagnosis and quality control evaluations were performed collecting at least 30,000 events, whereas for MRD measurements 300,000 events were acquired per MoAb combination from 750,000 stained cells. Cell acquisition was performed with the different cytometers (see earlier) using the EXPO™32 or RXP™ (BC), and the CELL Quest™ software (BD), respectively. Data analysis was done either with the EXPO™32, RXP™, or the PAINT-A-GATE™ software (BD). Leukemic and normal B or T cells were identified using an immunological gate (associated with 90°-scattering, SSC), which included all CD19 or CD7-positive cells. Minimal residual disease was defined as an accumulation of at least 10 clustered events displayinglymphoid-scattering properties and leukemia-associated immunophenotypic characteristics as reported (5, 10, 12, 15, 20). Skewing of the proportional MRD-quantification by irrelevant nonnucleated events was avoided by relating MRD only to SYTO®16-positive, i.e., nucleated cellular events (5). The gating and analysis strategy is exemplified in Figure 1.
6Continued training of study group members was pursued to support the standardization process, i.e., (i) Rotational personnel exchange (multiple day site visits), (ii) Twice-yearly entire group workshops with written summary reports, (iii) Joint list-mode data (LMD) file reviewing during workshops or online via FTP-server (access limited to group members) (21). LMD files could be exchanged between BC and BD software and analyzed without transformation requirements provided that the advanced PAINT-A-GATE™ Pro 3.0.2 software (BD) was used and that acquisition on BC cytometers had been performed with the baseline offset function “on”.
Patients and Samples
By January 2004, 413 patients with ALL (age 1–18 years) treated according to trial AIEOP-BFM ALL 2000 had been recruited on a per-center basis for flow cytometric MRD assessment in the four-center collaboration (Center 1: n = 110; Center 2: n = 88; Center 3: n = 61; Center 4: n = 154). PB and bone marrow (BM) samples of 395 of these patients (95.6%) were received at diagnosis and from follow-up during induction treatment: PB at Days 8, 15, 22, 33, and BM at Days 15, 33, and 78. MRD investigations were approved as part of the international trial by the institutional ethical committees and were done according to informed consent guidelines. A brief outline of the induction treatment used in the trial has been summarized recently (11). MRD-based risk assessment by FCM was done according to our previously published double-time-point algorithm (5, 22). In brief, patients with positive MRD ≥1 blast cell/μL (mostly equivalent to ≥0.01%) in BM on Day 78 were qualified as having a high relapse risk, whereas patients with MRD ≤10/μL (mostly equivalent to ≤0.1%) in BM already at Day 15 (and negative thereafter on Days 33 and 78) were assigned to the low-risk group. All other patients were included into the intermediate risk group. Absolute MRD values (blasts per μL) were calculated as described previously (5).
List-Mode Data Exchange
Thirty-one of the 395 patients with MRD data (8%) were selected for comparisons between centers. The selection of patients (n = 8 per center, except for Center 3, which incidentally submitted only seven patients) was done on the basis of randomly chosen dates of recruitment, which were the same for all centers, in two periods (early in 2002, i.e., Series 1 of n = 15 patients, and late in 2003, i.e., Series 2 of n = 16). There were 25 patients with B-cell precursor (BCP)-ALL and six with T-ALL selected. From these, a total of 202 samples from seven time-points of assessment (PB from Days 0, 8, 15, 33; BM from Days 15, 33, 78) were submitted to all centers for blinded LMD file interpretation. The four independent votes on each sample (one per center) were then collected (n = 800 submitted votes of 808 possible; 99%) and compared by the study coordinator (MND), who was not involved in the original LMD analyses. Concordance regarding qualitative (positive/negative) and quantitative MRD values was assessed. As denominators for comparisons, we used the majority votes per sample in qualitative analyses, and for quantitative concordance in MRD-positive samples, the median of the positive values. In case a majority vote was not available (8/202 samples, 4%), an extra vote of the study coordinator was used to define positive or negative. Failures were defined as (i) a negative vote in a sample otherwise qualified positive by three of the centers, or a positive vote in a sample regarded negative by the other three, and (ii) a positive MRD-level >3× larger or smaller than the median of the positive values of the sample (i.e., >half a log up or down in a log10-correlation). In addition, agreement in assessing a patient's risk stratification was also tested. For that we used the prepublished FCM-based risk algorithm (see earlier), as well as a newer single-time-point algorithm based only on Day-15 BM. The latter we currently also screen for outcome relevance in the AIEOP-BFM ALL 2000 study: low risk if MRD < 0.1%, high risk if ≥10%, and intermediate risk all others.
Inter-laboratory concordance estimation by exchange of viable cell specimens for FCM is hampered by preanalytical factors like time for additional transport. Therefore, sample exchange was primarily done only between the two rather closely located Italian centers. At first entry, 20 patients were randomly chosen for parallel investigations. In case of abundant specimen for a second test, follow-up samples (n = 63; BM on Days 15, 33, 78; PB on Days 8, 15, 33) were divided and sent via express mail at ambient temperature to the other center. The day after sampling (<32 h) both centers performed a blinded FCM analysis including absolute quantification of MRD (as earlier).
More recently, a reagent (TransFix®, UK NEQAS, Sheffield, UK) for appropriate preanalytical stabilization of specimens became available (23). This was used with artificial serial dilution preparations of leukemic cells from nine patients (seven BCP- and two T-ALLs) admixed into normal regenerative BM (containing a median background of 0.59% CD34+ early physiologic BCP [range 4.21–0.01%] and of 6.3% total normal B-cells [12.58–1.13%]. These samples were distributed to all four centers for blinded concordance testing. The samples (n = 42; i.e., 29 positive: MRD range 7.5–0.004%; 13 negative) were stained and analyzed upon receipt with a date-of-analysis variance between centers of 1 day in median (range 1–6 days). Failures were defined as earlier.
The intraclass correlation coefficient (ICC) was used to assess the inter-rater concordance of MRD-values (logarithmic scale). The ICC was calculated according to Shrout and Fleiss (Model 2, i.e., ICC2.1) (24). The cutpoint of 0.75 ICC discriminates between good and moderate to poor agreement of observed versus expected results as suggested by Portnoy and Watkins (25). The calculations were done with SAS 9.1 software using the macro %INTRACC (http://support.sas.com/ctx/samples/index.jsp?sid=537).
In the series of sample exchange data, the kappa- coefficient (κ) was used in addition to the proportional quantification of the positive/negative concordance to eliminate the impact of chance. Concordance of independent MRD-data from the centers and of the risk estimates was assessed with the χ2 test.
Qualitative Concordance of Analyses of Exchanged List-Mode Data
Of 202 submitted samples, 106 were classified as MRD-positive (53%) and 96 as negative. Per 1 to 4 centers, the agreement of observed with expected votes was high: 89, 97, 93, and 96%, respectively. All four centers agreed on the MRD-status in 76% of samples overall (in MRD-positive: 78%; in negative: 73%). There was no significant difference between sample series 1 and 2. Agreement by at least three centers was found in 96% of the total sample cohort. Thus, only eight samples (including five from Day-33 BM) remained with tie of votes on either side. As causes of discordance in these, we found disturbance by normal lymphoid regeneration in three cases, MRD at the limits of detection in further two, and in three samples technical flaws (discordance of tubes due to carry over; poor compensation; BC baseline-offset function “off” instead of “on”).
Per time-point, agreement was best in BM samples from Day 15 (86% by four centers) and Day 78 (81%), as opposed to those from Day 33 (52%). Three centers agreed in 100, 96, and 84%, respectively. In analyzing PB samples from the Days 0, 8, 15, and 33, there was complete agreement in 100, 83, 62, and 73%, respectively. By three centers it was at least 97% at all time-points.
According to leukemia phenotype, agreement by four centers was 78% in samples from BCP-ALL (130/167 specimens) and 66% in T-ALL samples (23/35). At least three centers agreed in 96 and 94%, respectively.
Quantitative Concordance of Analyses of Exchanged List-Mode Data
The overall-concordance of observed versus expected MRD-values including quantitative aspects (n = 800 values cumulatively) was high (ICC 0.979). There was no relevant difference between Series 1 (0.986) and 2 (0.975). As shown in Figure 2, there was also little difference between Centers 1 and 4 regarding the agreement in their observed with the expected votes: ICC was 0.983, 0.993, 0.997, and 0.995, respectively. With respect to sample origin, the variance in the ability to interpret the data was also small: samples from Centers 1 to 4 rendered a coefficient of 0.987, 0.993, 0.922, and 0.997, respectively.
Of the 106 MRD-positive samples, correct MRD-levels were quoted by Centers 1–4 in 82, 93, 85, and 94%, respectively. All four centers agreed on the MRD-levels in 67% of samples, and at least three centers in 86%. Concordance was slightly better in PB samples than in BM: All four centers agreed in 72% vs. 56% of respective samples (91% vs. 75% by at least three centers). Agreement was gradually declining with the level of MRD. Samples positive ≥10% (n = 27), ≥1–10% (n = 21), ≥0.1–1% (n = 33), and <0.1% (n = 25) showed agreement of all four centers in 96, 71, 64, and 36%, respectively. By at least three centers, agreement was 100, 90, 85, and 68%, respectively. Cumulatively, there were 25 false-negative estimates (6.0%) among the 420 available single values from all positive samples. Seventeen of these 25 values were from samples MRD-positive <0.1%. Additional 22 estimates (5.2%) described wrong MRD-levels. In 14 of these 22 cases, the difference did not exceed ±1 log10 of the expected value.
Among the 96 negative samples, concordantly negative votes were given by all centers in 74%, and by at least three centers in 97%. There were 24 false-positive estimates (6.3%) among the 380 available single values from all 96 negative samples. Seventeen of these 24 values were falsely interpreted positive but with low MRD (<0.1%).
Concordance of Risk Estimates upon Analyses of Exchanged List-Mode Data
As a function of the concordance of MRD-level values, we also compared the risk estimates which could be defined for 28 of the 31 analyzed patients. By Centers 1–4, the observed risk estimates matched the expected in 79, 89, 100, and 93%, respectively, based on the double-time-point risk algorithm. With regard to the single-time-point algorithm (Day-15 BM), concordance was even better (by centers: 96, 89, 100, and 89%). Among 112 single Day 15-based risk estimates, we found seven false votes (6%). Four of the latter were derived from two samples with MRD just at the border of 0.1%.
Reproducibility in Inter-Laboratory Sample Exchange
Among 63 samples exchanged between two centers, the positive/negative concordance of MRD-estimates was 90% (κ = 0.81). All six discordant samples segregated around 0.01%. As shown in Figure 3A, the reproducibility of MRD-values including quantitative aspects was high (ICC 0.97 for relative estimates; 0.96 for absolute MRD-values, data not shown). Four of the 28 samples regarded positive by both centers showed a more than threefold difference between the paired estimates.
The concordance between all four centers in the artificial dilution experiments (Fig. 3B) was also high (ICC 0.98). Of 164 MRD-values available (of 42 submitted samples), there were five false values each among 113 positive and 51 negative values (sensitivity 95.6%; specificity 90.2%), as well as one extra failure in quantitative terms in a positive sample. All four centers agreed on the MRD-status including quantitative aspects in 77% of samples (30/39), and at least three centers in 95% (39/41). Poorer agreement was in several samples due to insufficient red cell lysis after prolonged transportation, as well as due to few sample caused by tube leakage.
Agreement of MRD Results from Independent Patient Cohorts
We tested the agreement of the four centers with respect to the available MRD-results from their locally recruited patient cohorts (Fig. 4A). In PB samples, differences in results from the four centers at various time-points were not statistically significant. In BM analyses, the proportion of MRD-positive samples only at Day 15 varied significantly (P < 0.001) between the centers: Center 3 with the smallest patient cohort reported 72% positive, as opposed to the other three centers (89–94%; overall 89%). At the other time-points, we found no statistically significant variance. Also, the proportions of patients distributed to different risk groups (Fig. 4B) did not differ significantly between centers.
In this report, we show that the implementation of several measures of standardization renders a high degree of concordance between different laboratories in assessing MRD in ALL by FCM. We based standardization on three cornerstones, i.e., personnel education, technical alignment, and continued quality control.
To warrant a homogeneous approach to data interpretation, which could otherwise be the most crucial cause for discordances despite preparative alignment, we fostered continued personnel training, performance feed-back, and exchange of experience by joint analyzing of LMD files. Notably, the most intensive period of staff education and alignment took place before the start of file and sample exchange, so that we describe in this article mostly the output of these efforts. At commencement of the study in the year 2000, the data interpretation varied strongly between more and less experienced centers in particular at regenerative follow-up time-points (e.g., Day 78: before alignment up to 32% of samples were considered MRD-positive by individual centers; 5–11% after alignment of the four centers, as shown in this manuscript). Nevertheless, the system of recursive training was continued throughout the whole study period and led to several smaller adaptations also later on (red cell lysis efficiency, see later). As delineated in this report, all the measures together led to a high degree of concordance between centers with no exception among them. Of note, usage of different cytometers, and thus, of partly different MoAb-conjugates and analysis software did not seem to have an adverse impact on agreement.
Technical standardization concerned the sample preparation, the choice of MoAbs and their combination, the acquisition process, as well as the monitoring and optimization of the instrumental setup. By using a fixed panel of MoAb combinations with broad relevance, we decreased also the need of tailoring the panel upon phenotypic characteristics of individual patients. Placement of markers and their labels among four-color combinations was designed to minimize the impact of fluorochrome interactions and to take advantage of the orientation of leukemia-associated aberrations. Strong fluorochromes (like PE) were used with antigens under-expressed on blast cells for optimal separation of leukemia and normal cells. Related to these issues, the staining setup had to be changed during the earliest study period (e.g., CD11a-PE instead of -FITC; PE-Cy7 instead of PE-Cy5 on BD-cytometers).
Several quality control measures (using beads and extra stainings) were taken to warrant stable instrument performance and to allow for in-sample gauges of staining quality and nucleated cell content. Assessing the latter ensures that only nucleated cells and not residual erythrocytes or other irrelevant events are used as basis in calculating MRD proportions. It also permits to assess the achievable sensitivity in a given sample. Theoretically, the minimal detectable amount of MRD in our setting is 10 blasts among 300,000 nucleated cells (0.0033%). Practically, this cannot be reached when the nucleated cell content among acquired events is diminished, e.g., due to nonlysed red cells. By ensuring more efficacious red cell lysis, a transit time from sampling to analysis below 32 h thus supports sensitivity (see supplemental Fig. 2). Incidentally, it is convenient to know that MRD assessment on the following day after sampling warrants largely similar results as if tested immediately (see supplemental Fig. 3). Differences between centers regarding the efficiency of red cell lysis were seen very early during the study by LMD file exchange, which emphasized a need for further standardization. Hence, the typical limit of verifiability of our system is currently at 0.01%.
Nevertheless, we noticed several aspects for future improvement. Inter-center variance was highest at low levels of MRD, i.e., in samples MRD-positive at <0.1% and from time-points with values ranging typically between 0.1 and 0.01% (Day-33 BM and Day-15 PB). Some tube-to-tube differences and paucity of MRD events led to different interpretations. The significant difference between centers in Day-15 BM results in the independent cohort comparison that was also due to divergent proportions of patients with values <0.1% (data not shown). We suppose that these short-comings in sensitivity could be overcome by increasing the event acquisition (see Fig. 5). Hence, we currently develop our methodology toward acquisition of 1,000,000 events per tube with a panel of fewer, but ≥6-color marker combinations. The sensitivity of the approach also determines the usefulness of MRD-thresholds and time-points for robust MRD-based risk stratification. Hence, a stratification algorithm based on thresholds ≥0.1%, like the one currently evaluated in our ongoing study based on Day-15 BM, would particularly be convenient with the methodology presented in this study (overall correct risk-estimate rate 94%). The previously published double-time-point risk algorithm including Day 78, a frequently MRD-negative time-point, showed a correct rate of 90% (5, 22). Notably, this quite adequate rate seems related to the fact that concordance in truly negative samples was found to be higher than in samples with low MRD, suggesting that specificity is a less delicate issue than sensitivity. Both these rates of reproducibility of FCM-based risk group stratification favorably compare to those reported recently for RQ-PCR in a multicenter setting (73% in repeat experiments, 81 and 86% in two-center comparisons) (26). Regarding specificity of FCM analyses, however, a background of normal lymphoid regeneration in BM may cause discordance due to misinterpretation as leukemia cells. Usually, such regeneration after BFM-type induction only occurs at Day 78 (5, 27). However, delays of treatment because of clinical complications may lead to untimely occurrence. In this respect, it is helpful to compare the dates of schedule and actual sampling after diagnosis. In addition, application of several marker combinations allows to reliably distinguish normal immature from leukemic cells in most cases (28, 29).
In conclusion, our attempts at standardization, quality control, and education seem appropriate for reaching a high degree of concordance in multicentric MRD assessment with FCM. This is particularly relevant when trying to use this methodology in internationally collaborative therapy trials, as is typical for the European attitude. Based on outcome correlations on a large and prospective scale, our ongoing study will finalize the values of FCM-based MRD-assessment in ALL within BFM-type therapy trials and in particular in comparison to PCR-based assessment.
This study was initiated and promoted by the International BFM Study Group. The data were presented in part at the 46th Annual Meeting of the American Society of Hematology, 2004, San Diego, California. The authors thank all doctors, nurses, and technicians of the participants of the BFM and AIEOP study groups for their close collaboration in providing samples and data from their patients. The authors also thank M. Martin (Berlin), V. Leoni (Monza), L. De Zen and B. Buldini (Padova), G. Fröschl and D. Printz (Vienna) for excellent assistance in data analysis. G. Mann, A. Attarbaschi, and D. Janousek are acknowledged for clinical data supply.