Weakly supervised annotation‐free cancer detection and prediction of genotype in routine histopathology

Deep learning is a powerful tool in computational pathology: it can be used for tumor detection and for predicting genetic alterations based on histopathology images alone. Conventionally, tumor detection and prediction of genetic alterations are two separate workflows. Newer methods have combined them, but require complex, manually engineered computational pipelines, restricting reproducibility and robustness. To address these issues, we present a new method for simultaneous tumor detection and prediction of genetic alterations: The Slide‐Level Assessment Model (SLAM) uses a single off‐the‐shelf neural network to predict molecular alterations directly from routine pathology slides without any manual annotations, improving upon previous methods by automatically excluding normal and non‐informative tissue regions. SLAM requires only standard programming libraries and is conceptually simpler than previous approaches. We have extensively validated SLAM for clinically relevant tasks using two large multicentric cohorts of colorectal cancer patients, Darmkrebs: Chancen der Verhütung durch Screening (DACHS) from Germany and Yorkshire Cancer Research Bowel Cancer Improvement Programme (YCR‐BCIP) from the UK. We show that SLAM yields reliable slide‐level classification of tumor presence with an area under the receiver operating curve (AUROC) of 0.980 (confidence interval 0.975, 0.984; n = 2,297 tumor and n = 1,281 normal slides). In addition, SLAM can detect microsatellite instability (MSI)/mismatch repair deficiency (dMMR) or microsatellite stability/mismatch repair proficiency with an AUROC of 0.909 (0.888, 0.929; n = 2,039 patients) and BRAF mutational status with an AUROC of 0.821 (0.786, 0.852; n = 2,075 patients). The improvement with respect to previous methods was validated in a large external testing cohort in which MSI/dMMR status was detected with an AUROC of 0.900 (0.864, 0.931; n = 805 patients). In addition, SLAM provides human‐interpretable visualization maps, enabling the analysis of multiplexed network predictions by human experts. In summary, SLAM is a new simple and powerful method for computational pathology that could be applied to multiple disease contexts. © 2021 The Authors. The Journal of Pathology published by John Wiley & Sons, Ltd. on behalf of The Pathological Society of Great Britain and Ireland.


Introduction
Colorectal cancer (CRC) is one of the most common types of cancer and one of the top causes of cancer mortality [1]. In routine clinical workflows, CRC is diagnosed by histopathologic evaluation of H&E-stained tissue slides. In addition, all patients with metastatic or unresectable CRC are recommended to undergo testing for microsatellite instability (MSI) or mismatch repair deficiency (dMMR) and should be tested for mutations of the KRAS and BRAF genes [2]. In the UK, all patients with CRC, irrespective of tumor stage, are recommended to undergo MSI or dMMR testing [3]. Metastatic MSI/dMMR CRC are directly targetable by cancer immunotherapy, which is currently approved as a first-line therapeutic approach to this disease subtype [4]. MSI, as determined by polymerase chain reaction (PCR), and dMMR, as determined by immunohistochemistry (IHC), are used interchangeably in most clinical situations, although the results of these different tests are not always concordant [3,5]. Another type of clinically relevant genetic alteration in CRC is a mutated BRAF gene in metastatic CRC, which is directly targetable in a second-line therapeutic setting [6]. Currently, diagnosis of cancer on histopathology images and genetic testing on tumor tissue form two distinct laboratory workflows: although they are both coordinated by the pathologist as a central coordinator, they are performed using different laboratory methods. However, increasing efforts to digitize routine histopathology workflows [7,8] will potentially make digitized whole-slide images (WSI) routinely available in the future. Recent studies have shown that a wide range of molecular features, including MSI/dMMR status and BRAF mutational status, can be predicted from digitized slides of CRC using deep learning, an artificial intelligence technology [9][10][11][12][13][14]. The application of such methods is not limited to CRC but has been demonstrated in bladder cancer [15], breast cancer [16,17], sarcoma [18], head and neck cancer [19], hepatocellular carcinoma [20], and several other types of solid tumor [8,12]. Therefore, in the future, deep learning could supplement current molecular testing strategies in solid tumors and could be used as a tool for translational research [21].
Multiple different technical pipelines have been proposed to infer molecular alterations from WSI and each of them has limitations [8]. The first scientific publications in deep learning-based molecular subtyping in 2018 and 2019 applied a simple tumor annotation-based 'majority vote', i.e. they were based on a two-step process: first, they located tumor tissue in the tissue section based on manual [22] or automatic segmentations and, subsequently, the tumor tissue was processed by another neural network [10]. Further studies showed that a manual annotation-based approach could yield very high performance for tumor detection on large datasets [23]. More recent studies used deep learning to predict genotypes directly from the whole slide, including tumor and non-tumor tissue. These so-called weakly supervised approaches do not require any explicit tumor detection, applying a simple whole-slide majority vote [9,12]. Such approaches have achieved a high performance for the prediction of molecular alterations, but they sacrifice interpretability. Using the whole tissue to predict molecular features in the tumor tissue imposes predictions of molecular changes on non-tumor regions such as normal mucosa, which may not be useful or desirable. More recent studies have addressed this issue by using a new technology based on multiple-instance learning: in the context of prostate cancer detection, a weakly supervised approach yielded a clinical grade performance [24] and is currently being marketed as a commercial product [25]. Other attention-based approaches were recently applied to tumor detection in various cancer types. For example, clustering-constrained attention multiple instance learning has been proposed as a powerful methods pipeline for tumor detection and determining histopathologic subtypes [26,27]. However, attention-based multiple learning is not widely used for predicting molecular alterations from image data, possibly because these models are complex and data-hungry [24].
A general observation is that methods pipelines in computational pathology become more and more intricate: they require hand-crafted network models, loss functions, and intricate pre-/post-processing pipelines, which cannot be easily implemented using standard programming libraries [24,[28][29][30]. In particular, the custom architectures and data loader required for these methods are not available out-ofthe-box in popular machine learning environments, such as PyTorch, TensorFlow, Keras, or Fastai. This is in stark contrast to initial publications, which were easily reimplementable in standard programming environments in a few lines of code [10,22]. As complex workflows limit widespread reproduction and adoption, there is a need for powerful, adaptable, easily implementable, end-to-end methods for molecular testing of cancer.
Therefore, in this study, we sought to combine the easeof-use of off-the-shelf models with one-stop-shop convenience and improved interpretability. At the same time, we strived to deliver the first application of deep learning for one-shot tumor localization and genetic subtyping in CRC. In other words, we aimed to unify the workflows of tumor diagnosis and subtyping in a single-pass neural network.

Ethics statement and patient cohorts
For this study we used anonymized H&E-stained slides of colorectal adenocarcinoma of two large cohorts.
Weakly supervised cancer detection and prediction of genotype 51 To train the neural network we used digitized tumorbearing tissue slides from the Darmkrebs: Chancen der Verhütung durch Screening (DACHS) study (n = 2,448 patients), a large population-based case-control and patient cohort study on CRC, including samples of patients with stages I-IV from different laboratories in southwestern Germany. We received and used exactly one tumor-bearing tissue slide per patient. For n = 1,281 of these patients, an additional non-tumor slide was available, i.e. a tissue slide extracted from the same surgical specimen but containing only normal colon mucosa, submucosa, and smooth muscle tissue. This 'normal' tissue slide was used as an additional input for the deep learning model, as explained below. Use of the DACHS tissue samples for scientific purposes was approved by the ethics committees of Heidelberg University and the medical boards of Rhineland-Palatinate and Baden-Württemberg, with the written informed consent of all participants [31]. The digitized tissue slides were provided by the Tissue Bank of the National Center for Tumor Diseases (Heidelberg, Germany) in accordance with the regulations of the tissue bank. Some tissue slides in the DACHS cohort had blue and/or black pen marks circling tumor tissue and/or normal tissue on the slide. MSI status in the DACHS cohort was investigated using a three-plex PCR panel, as described previously [32]. For external validation we applied the deep learning system on H&E-stained slides derived from the population-based Yorkshire Cancer Research Bowel Cancer Improvement Programme (YCR-BCIP) [33], comprising 889 patients who had surgical resection. dMMR or mismatch repair proficiency (pMMR) was determined with a standard four-plex IHC assay on whole slides. No pen marks were present on the slides in the YCR-BCIP dataset. The clinicopathologic characteristics of all patients are summarized in Table 1. Glass slides in DACHS and YCR-BCIP were digitized with Leica Aperio scanners (Leica Biosystems, Wetzlar, Germany) using a 20Â objective and were saved as SVS files with JPEG compression. We received and used exactly one digitized tumor slide from each patient in the DACHS cohort and the YCR-BCIP cohorts. Only patients with an available H&E slide and clinicopathologic features were used for the analysis. Some samples were excluded due to missing clinicopathologic data or missing WSI. Sample flowcharts for all experiments are provided in supplementary material, Figure S1.

Image preprocessing pipeline
Non-overlapping image tiles with a size of 512 Â 512 pixels with a resolution of 0.5 μm per pixel were extracted from the WSIs. Tiles with background (more than 50% white area on the tile), blurry artifacts, and pen marks were removed during the tessellation process. The standard deviation of each color channel in a tile and the average detected edges using canny edge detection of OpenCV package in Python 3.8 were used to detect these tiles. To remove the bias of different staining procedures, all tiles were normalized based on one reference image using the Macenko normalization method using a reference image that is publicly available at: https://raw.githubusercontent.com/jnkather/ DeepHistology/master/subroutines_normalization/Ref. png [34]. After this step, tiles were used as an input for the neural network. Whenever a slide contributed more than 1,000 tiles, only 1,000 randomly chosen tiles were used. The source code for data preprocessing is available under an open-source license at: https://github.com/ KatherLab/preProcessing. For all experiments, only patient-level labels were used and all tiles in the training sets were assumed to inherit the label of their parent patient. To mitigate class imbalance in the patient labels during training, tiles from the more abundant class were randomly undersampled. This means that for training neural networks, equal numbers of tiles from the positive and negative classes were used and classifiers were trained on tile-level-balanced image sets. For deployment of classifiers to the test partition in cross-validation or to the external validation set, no such class balancing procedure was applied.

Algorithm
Here, we propose a new method, the Slide-Level Assessment Model (SLAM). We assume that colorectal tumors can carry a feature of interest, the 'target', which is defined on the level of patients. The aim is to determine the presence of the target directly from a digitized glass slide (WSI). In the present study we explored the following targets: BRAF status (mutated or non-mutated), MSI/MMR status (MSI/dMMR or microsatellite stability [MSS]/pMMR), and grade of differentiation (high grade, comprising poorly differentiated and undifferentiated [grade 3-4] and low grade, comprising well and moderately differentiated [grade [1][2]). For all targets, only slide-level labels, not tile labels, are available. We assume that only the tumor tissue carries information related to these labels, but tumor-bearing slides usually contain some non-tumor tissue adjacent to the tumor. The state of the art (SOTA) model is to train end-toend deep learning systems on all tiles generated from these WSIs, tumor and non-tumor [9,12]. This is potentially suboptimal as it dilutes the information of interest and assigns a prediction score for non-tumor tiles.
Although some studies solve this problem with manual annotations [23] or adding a separate network for tumor detection [10], SLAM solves this in a single step: SLAM uses an end-to-end neural network based on ShuffleNet, a lightweight off-the-shelf model [35]. The output layer has been modified to have three output classes: tumor tissue belonging to the positive class (mutated, MSI/dMMR, high grade, etc.), tumor tissue belonging to the negative class (non-mutated, MSS/pMMR, low grade, etc.), and non-tumor tissue, which is assumed to be non-informative regarding the presence of the target class (label). This procedure can be extended to an arbitrary number of target classes. We used WSI with slidelevel labels for training. Based on the ground truth 52 PL Schrammen et al labels, each slide was assigned to one of these three classes: 'positive' tumor slides (containing tumor tissue in the positive class as well as some non-tumor/noninformative tissue), 'negative' tumor slides (containing tumor tissue in the negative class as well as some nontumor/non-informative tissue), and non-tumor slides (containing only non-tumor/non-informative tissue). All image tiles generated from the slides inherited the Weakly supervised cancer detection and prediction of genotype 53 slide-level label (positive, negative, or non-tumor/noninformative) and were used to train the network. Thus, even though the training sets were contaminated with non-tumor/non-informative tissue, the SLAM network can learn to distinguish tumor tissue from non-tumor/ non-informative tissue because the non-tumor tissue is introduced as an explicit third class. When deployed to an image in the test set, each tile is assigned a probability value (tile-level soft prediction). The class with the highest probability value for each tile is used for all further steps (tile-level hard prediction). Thus, each tile is assigned a single prediction category by the deep learning SOTA (positive or negative) model or SLAM (positive, negative, or non-informative). SOTA methods have also been applied to multiclass problems [9], in which the performance for each target class is obtained by a one-versus-rest procedure. Like SOTA, SLAM is able to handle such multiclass problems. For simplicity, we only refer to the (much more common) binary classification problem from now on. For N_pos being the number of mutated tiles and N_tot being the total number of tiles, the patient prediction scores (PPS) in SOTA [9] are defined as follows: PPS = N_pos/N_tot. However, it is known that N_tot is contaminated by non-informative tiles corresponding to normal tissue. Therefore, N_tot is artificially inflated if there is a relevant amount of non-tumor tissue on the slide. This is solved by SLAM, which predicts positive tumor tiles (N_pos), negative tumor tiles (N_neg), and non-tumor or non-informative tiles (N_nt). PPS in SLAM are calculated as PPS = N_pos/(N_tot À N_nt). Technical details are listed in supplementary material, Table S1 and an additional description of SLAM is provided in supplementary material, Figure S2. In this study, we compared the performance of SLAM to the SOTA algorithm.

Experimental design and statistics
First, we tested whether tumor slides and non-tumor slides could be distinguished with a high accuracy. Then, we trained SLAM on five binary classification tasks in a within-cohort approach by using patient-level three-fold cross-validation in the DACHS cohort. The classification targets were grade (low/high), gender (female/male), KRAS mutation (mutated/wild type), BRAF status (mutated/wild type), and MSI/MMR status (MSI/MSS or dMMR/pMMR). Gender was included as a negative control. Although MSI and dMMR are measured by different laboratory methods (PCR and IHC, respectively) and are not 100% overlapping, they are widely regarded as synonymous for clinical decision making. Therefore, here we refer to 'MSI/dMMR status'. Finally, we validated the model trained on DACHS on an external cohort, YCR-BCIP, for prediction of MSI/dMMR status. The primary statistical endpoint was the area under the receiver operating curve (AUROC) with 100-fold bootstrapped confidence intervals. This means that the confidence intervals were obtained by a procedure in which a list of PPSs was generated 100 times, AUROCs were re-calculated, and the 95% confidence interval on this distribution is given. Each time the list of prediction scores was generated, n patients were randomly chosen from the list of N patients with replacement. This procedure was performed by the Matlab function 'perfcurve', which is documented at https://www.mathworks.com/help/stats/ perfcurve.html. Secondary statistical endpoints were accuracy, sensitivity, specificity, and F1 score of the SOTA model and SLAM. To generate a cut-off value for these statistics, an identical automatic procedure was applied to the patient-level prediction scores in each experiment. Using the ROC curve, the closest threshold value corresponding to a sensitivity of 80% was identified and rounded to three decimal places. Subsequently, using this threshold value, a confusion matrix and statistics were calculated. Because ROC curves are not continuously defined, the final sensitivity could differ from 80% (see supplementary material, Table S2).

Visualization
To visualize three classes in a single visualization, we employed multiplexed heat maps using three base color vectors to achieve close to perceptually optimized color maps, as described previously [36]. Based on each tile prediction value z for MSI (z MSI ), MSS (z MSS ), and normal (z normal ), the color C was generated with three red, green, blue (RGB) color vectors c (c MSI = [0.8,0,0], c MSS = [1,1,0], c normal = [0,0,1]) as follows: C = z MSI * c MSI + z MSS + c MSS + z normal * c normal . To generate smooth maps from sparse predictions, we interpolated between the z values on a regular two-dimensional grid.
In addition, we selected the highest scoring tiles (based on tile-level soft predictions) for the highest-scoring patients (based on patient predictions) and reviewed these tiles with a pathologist to identify humaninterpretable morphologic patterns of interest.

Automatic slide-level tumor detection and grading
Here we present SLAM ( Figure 1A-C). First, we assessed the ability of SLAM to automatically detect tumor-bearing slides on a slide level using weak (slidelevel) labels with a three-fold cross-validation approach, using WSIs containing both tumor and non-tumor tissue as well as normal tissue (non-tumor colorectal tissue) slide images without any tumor tissue. In DACHS (n = 2,448 patients,  Figure S4). Similarly, for inference of BRAF mutational status based on slide-level labels, SOTA achieved an AUROC of 0.782 (0.736, 0.813), which was improved to 0.821 (0.786, 0.852) by SLAM (see supplementary material, Figure S3B and Table S2). Again, accuracy, sensitivity, specificity and F1 score were also improved. Taken together, these data show that SLAM improves detection performance of molecular subtypes compared with SOTA. Importantly, we found that performance for prediction of MSI/dMMR status and BRAF mutational status particularly increased from SOTA to SLAM in the high-sensitivity region of the classification model, i.e. the upper region of the ROC curve (Figure 2A and supplementary material, Figure S3B). In addition, we evaluated whether KRAS mutational status was predictable from tissue slides. Previous studies have shown only a low predictability of KRAS status from slides by previous approaches [9], which in our experiments was reflected by a poor Weakly supervised cancer detection and prediction of genotype 55   Figure S3B and Table S2). To assess the histopathologic plausibility of the proposed approach, we manually reviewed the 25 highest predictive tiles from the 25 highest predictive patients for all targets. We found that for prediction of MSI/dMMR status ( Figure 2B), SLAM identified poorly differentiated and lymphocyte-rich image tiles as being the most predictive for MSI, whereas well-differentiated tumor glands with dirty necrosis were the most predictive for MSS. In the patches representative for normal tissue, i.e. non-informative tissue for the prediction of MSI/dMMR status, normal colon mucosa and smooth muscle tissue were the most prevalent tissue types. Similarly, for the model trained to predict tumor grading, high-grade and low-grade image tiles represented plausible tissue patterns ( Figure 2B). Again, the highest scoring normal tissue tiles showed normal colon mucosa. Together, these results show SLAM's capabilities of improving prediction of genetic alterations in tumors by automatically detecting and excluding non-tumor tissue.

Multivariable visualization improves interpretability
Previous deep learning studies in digital pathology have provided univariate prediction heatmaps to make model predictions understandable to human observers. However, SLAM by design outputs multiplexed predictions, which require multivariate visualization. To achieve this, we developed a trivariate visualization method that allowed us to display tumor detection and predict genetic alterations in a single heat map ( Figure 3A). Representative prediction maps for patients in each class are shown in Figure 3B. These heat maps provide assistance in a dissecting analysis of individual tumors as they display tumor heterogeneity in one glance. When reviewing the multiplexed visualization maps with expert observers, we found that tumor tissue in MSS/pMMR tumors and MSI/dMMR tumors ( Figure 4A,B) could be localized by a human observer. In addition, trivariate visualization maps highlighted tumor heterogeneity. In true MSS/pMMR tumors, although the tumor tissue was overall visualized as 'yellow' (MSS/pMMR), the tumor invasive margin was occasionally (mis-)classified as MSI/dMMR ( Figures 3B, 4A). Analysis of the underlying tissue slide revealed that these regions represented   Figure 1A). Correspondingly, accuracy, specificity, and F1 score were improved by SLAM compared with SOTA (supplementary material, Table S2). This demonstrates the generalizability of SLAM despite differences between the training set and the test set (e.g. a different method of determining the MSI/dMMR ground truth and the presence of pen marks in the training slides, but not in the slides in the validation set). In addition, a manual review of trivariate prediction maps showed that also in this cohort, tumor detection and subtyping was generally achieved in a spatially correct way.

Discussion
Tumor detection in digitized WSIs is a classical problem in computational pathology. A number of technical approaches to this problem were proposed even before the advent of deep learning methods [37]. Nowadays, deep learning approaches outperform hand-crafted pipelines for this problem [8]. However, in recent years a different type of problem has been increasingly addressed in computational pathology research. Beyond simple tasks, such as the detection of tumor tissue, it has been shown that deep learning is able to extract subtle visual features from histology images, making it possible to predict the presence of molecular alterations from routine pathology slides [21]. The central hypothesis to this approach is that the genotype gives rise to the phenotype, therefore genetic changes cause phenotypic changes and deep learning can infer the genotype of tumors just by observing tissue phenotype [22]. Here, we propose a simple workflow that improves prediction of genetic changes by simultaneously detecting tumor tissue in digitized pathology slides. Our approach only relies on slide-level labels, i.e. weak labels that are much cheaper and easier to generate than region-specific labels such as tile-level labels [24]. No manual tumor annotations whatsoever are required during training. We only trained on approximately 3,000 weakly labeled tissue slides, whereas previous studies have used much larger cohorts of up to 10,000 patients for training [23,24]. Although the tile-level labels in this approach are very noisy, we achieved a high performance for slide-level tumor detection (AUROC 0.980) and for molecular subtyping (AUROC 0.909 for MSI in the test cohort, 0.900 for MSI in the external validation cohort). In addition to accurate tumor detection on a slide level, our approach provides visualization maps for human readers that help in localizing tumor regions in heterogeneous tissue slides. These multiplexed visualization maps are generated with a new trivariate visualization method, which has previously only been applied for visualization of radiology image data [36]. This method allows expert observers to simultaneously check tumor localization capabilities and the predictions of molecular alterations of a deep learning model. We applied SLAM to multiple clinically relevant target features in CRC: MSI/dMMR status (which qualifies patients for immunotherapy [4]), BRAF mutational status (which qualifies patients for targeted therapy [6]), and grade of differentiation (an established histopathology feature defined on a case level). The cohort we used to investigate this was derived from a range of different pathology laboratories in southwest Germany, maximizing diversity of sample processing procedures. Finally, because all computational pathology methods should be validated in external cohorts in order to ensure generalizability [38], we evaluated classification performance on the YCR-BCIP cohort from 12 different institutions across the Yorkshire region of the UK. In this external cohort, we achieved a high performance with an AUROC 0.900 (0.864, 0.931) for MSI detection in addition to interpretable tumor localization. This demonstrates the robustness and generalizability of SLAM. Importantly, the idea behind SLAM is not to provide an automatic tool for perfect tumor segmentation in tissue slides, but to use detection of normal tissue as a tool to improve classification performance for the prediction of molecular alterations.
In an ecosystem of ever-increasing complexity of computational pathology workflows, the new approach provides a simple yet highly effective method for tumor localization and genotype prediction based on pathologic images. This simple method can be implemented using off-the-shelf models with transfer learning using standard deep learning libraries. Like any computational pathology method, before use in clinical routine, our method needs to undergo additional quality control and regulatory approval. A key limitation of our study was that it was only applied to a single tumor type, in which tumor tissue can be well distinguished from normal tissue. Future studies are needed to determine the performance of SLAM in other tumor types with more complex histopathologic patterns, such as pancreatic cancer or gastric cancer. We provide all of our source codes under an open-source license, allowing other groups to test SLAM in other disease contexts.   Table S1. Technical details of SLAM

60
PL Schrammen et al