An artificial intelligence system applied to recurrent cytogenetic aberrations and genetic progression scores predicts MYC rearrangements in large B‐cell lymphoma

Abstract Diffuse large B‐cell lymphoma (DLBCL), the most common type of non‐Hodgkin lymphoma, is characterized by MYC rearrangements (MYC R) in up to 15% of cases, and these have unfavorable prognosis. Due to cryptic rearrangements and variations in MYC breakpoints, MYC R may be undetectable by conventional methods in up to 10%–15% of cases. In this study, a retrospective proof of concept study, we sought to identify recurrent cytogenetic aberrations (RCAs), generate genetic progression scores (GP) from RCAs and apply these to an artificial intelligence (AI) algorithm to predict MYC status in the karyotypes of published cases. The developed AI algorithm is validated for its performance on our institutional cases. In addition, cytogenetic evolution pattern and clinical impact of RCAs was performed. Chromosome losses were associated with MYC‐, while partial gain of chromosome 1 was significant in MYC R tumors. MYC R was the sole driver alteration in MYC‐rearranged tumors, and evolution patterns revealed RCAs associated with gene expression signatures. A higher GPS value was associated with MYC R tumors. A subsequent AI algorithm (composed of RCAs + GPS) obtained a sensitivity of 91.4 and specificity of 93.8 at predicting MYC R. Analysis of an additional 59 institutional cases with the AI algorithm showed a sensitivity and specificity of 100% and 87% each with positive predictive value of 92%, and a negative predictive value of 100%. Cases with a MYC R showed a shorter survival.

F I G U R E 1 General objectives of the study MYC R may include non-IG genes [4]. The most notable translocations involving MYC and IG loci in DLBCL include t(8;14)(q24;q32) leading to a MYC and IG heavy chain fusion (MYC-IGH), t(8;22)(q24;q11) resulting in a MYC-IGL (lamda light chain) fusion and the less common, t(2;8)(p12;q24) that results in a MYC-IGK (kappa light chain) fusion with frequencies of 70%, 22%, and 8% respectively [5,6]. In a small number of cases, MYC R may include non-IG genes [4]. In terms of clinical outcome, DLBCL with MYC R (herein after designated as MYC+) has a decreased survival compared to other chromosome aberrations or those lacking a MYC R (herein after designated as MYC−); these cases may require more aggressive therapeutic regimens than the rituximab plus cyclophosphamide, doxorubicin, vincristine, and prednisolone (R-CHOP) [1,[7][8][9][10][11][12]. Preliminary studies have indicated a positive prognosis in MYC+ patients on aggressive treatment [13,14]. Therefore, establishing a MYC status in these patients is essential for prognostic purposes. Due to cryptic rearrangements and variation in MYC breakpoints, both chromosome and fluorescence in situ hybridization (FISH) analysis may fail to detect these translocations in some cases [15][16][17].
In case of FISH analysis, up to 10% of the cases may be incorrectly identified [18][19][20][21]. Indeed, Haralambieva et al. [21] reported 11% of MYC breakpoints may lie far from the 5′ or 3′ end of the MYC itself. In a separate study, 8q24 breakpoints were mapped greater than 350-645 kb 3′-downstream from MYC inside a cluster region [22]. Consequently, current commercially available FISH probes such as the dual color dual fusion probe set and the MYC break-apart probe may both fail to detect MYC R. Furthermore, other genetic alterations such as mutations, cryptic insertion of MYC into IGH, cryptic insertion of IG regulatory regions into MYC, deregulation of micro RNA-34B, or single nucleotide polymorphisms at 8q24 that may convey a shared underlying biology to MYC R have been implicated [15]. In fact, Hilton et al. [23] showed that the expression signature of MYC high grade DLBCL in which MYC had either cryptic alterations or rearrangements with non-IG partners is similar to the MYC double-hit DLBCL. Considering this and because of the clinical impact of MYC R, we sought to develop artificial intelligence (AI) systems composed of recurrent cytogenetic aberrations (RCAs) and derived genetic progression score (GPS) to predict MYC+ DLBCL tumors. In addition, we also performed identification of driver versus passenger alterations, evolution patterns in MYC+ tumors, and the clinical impact of RCAs on patient survival ( Figure 1).

The dataset and analysis methods
Mitelman Database of Chromosome Aberrations and Gene Fusions in Cancer (https://mitelmandatabase.isb-cgc.org, accessed on 5/20/2020) was searched for DLBCL cases during 1983-2019. This list was curated for cases with a break at 8q24 to identify MYC-rearranged (classical and nonclassical) and cases with no rearrangement at 8q24; these constituted cohort 1 cases. Initially, karyotypes were evaluated using CytoGPS [24], a software tool to parse karyotype nomenclature to identiy RCAs. Thereafter, each case was curated manually. A Fisher Exact two-tail test, a chi-square test, and a Bonferroni adjusted p-value were used to identify differences between the two groups. The Translational oncology package (TRONCO) in the R-environment was used to map evolutionary trajectory of RCAs [25].
Rtreemix package was used to calculate GPS [27]. The GPS is derived from the number or accumulation of genetic aberrations and the types of the aberrations from the data set. Late events that developed during tumor progression receive a higher weighted value compared to early events. Thereafter, the weighted value of each RCA is used to calculate the GPS of each tumor. A higher score suggests a higher-grade tumor with adverse outcome ( Figure 2).
The GPS for each tumor was then combined with RCAs to develop the AI algorithm. The system was composed of a neural network with 15 inputs and one output. A 10-cross validation was applied, and the neutral network (NNET) package was used to build the algorithm [28].
The NNET was selected because of its flexibility to outline each of the cases as MYC+ or MYC− based on a threshold value from the receiver operating characteristic (ROC) curve, as opposed to the "black box" prediction from the other classifiers. ROC curve was performed to evaluate the discrimination ability of the system. Seventy percent of cases from cohort 1 were used to train the system, and the remaining 30% of cases were used to test the system to predict MYC status.

F I G U R E 2
The schematic illustrates the generation of genetic progression score (GPS) based on the number of accumulated aberrations and time of occurrence of the aberrations from a computed temporal oncogenic tree or trajectory pathway (i.e., late event vs. early event) [27]. A late event obtains a higher weighted value than an earlier event, for example, 1p36 loss is assigned a higher value than −13; thus, higher number of accumulated aberrations and late events receive a higher score The general workflow of the MYC prediction model.  Figure 4).

Cohort 2 cases
Of these RCAs, gain of 1p34 and 1q14 was significantly asso- Using these classifiers, GPS was the most important feature predictor of a MYC R.
The 59 institutional cases (  (14)  difference in genetic progression score between these groups was significant (p < 0.0001) compared to MYC− patients (44% vs. 67%; p = 0.001) [7]. Even when treated with rituximab and anthracycline-based therapies, MYC+ DLBCL maintained a poor clinical outcome [1,9]. In pediatric patients, event free survival was six-fold less in MYC+ cases compared to MYC− [28]. Likewise, in the germinal center (GC)-DLBCL that carries a favor-able prognosis, MYC+ negates the positive outcome [11]. Therefore, detection of such rearrangements is of clinical importance. analysis is warranted to further enhance our diagnostic and prognostic accuracy in the cytogenetic laboratory.

FUNDING INFORMATION
The authors received no specific funding for this work.

CONFLICT OF INTEREST
The authors declare they have no conflicts of interest.