Automated curation of large‐scale cancer histopathology image datasets using deep learning

Artificial intelligence (AI) has numerous applications in pathology, supporting diagnosis and prognostication in cancer. However, most AI models are trained on highly selected data, typically one tissue slide per patient. In reality, especially for large surgical resection specimens, dozens of slides can be available for each patient. Manually sorting and labelling whole‐slide images (WSIs) is a very time‐consuming process, hindering the direct application of AI on the collected tissue samples from large cohorts. In this study we addressed this issue by developing a deep‐learning (DL)‐based method for automatic curation of large pathology datasets with several slides per patient.


Automated curation of large-scale cancer histopathology image datasets using deep learning
Background: Artificial intelligence (AI) has numerous applications in pathology, supporting diagnosis and prognostication in cancer.However, most AI models are trained on highly selected data, typically one tissue slide per patient.In reality, especially for large surgical resection specimens, dozens of slides can be available for each patient.Manually sorting and labelling whole-slide images (WSIs) is a very timeconsuming process, hindering the direct application of AI on the collected tissue samples from large cohorts.In this study we addressed this issue by developing a deep-learning (DL)-based method for automatic curation of large pathology datasets with several slides per patient.Methods: We collected multiple large multicentric datasets of colorectal cancer histopathological slides from the United Kingdom (FOXTROT, N = 21,384 slides; CR07, N = 7985 slides) and Germany (DACHS, N = 3606 slides).These datasets contained multiple types of tissue slides, including bowel resection Address for correspondence: Jakob Nikolas Kather, Else Kroener Fresenius Center for Digital Health, Technical University Dresden, Fetscherstrasse 74, Dresden 01307, Germany.e-mail: jakob-nikolas.kather@alumni.dkfz.deLars Hilgers and Narmin Ghaffari Laleh contributed equally.

Introduction
Over the course of time, we have observed a continuous increase in the amount of digital histopathology image data that is readily available.2][3][4] AI has been applied to numerous tasks based on information that can be extracted from histology slides, including cancer detection, 5,6 predicting the origin in cancer of unknown primary, 7 survival prediction, [8][9][10] genetic subtyping, 4,11,12 and prediction of treatment response. 13These methods are valuable research tools, which are also being incorporated into clinical routines as diagnostic algorithms approved by regulatory entities.While a substantial number of published studies have only relied on 100s or 1000s of digitized whole-slide images (WSIs), there are currently large academic and commercial consortia that aim to expedite the digitalization and accessibility of hundreds of thousands of pathological slides. 3,14he majority of published studies were carried out on highly selective image collections, where only one WSI is assumed to be representative of the entire patient case.In reality, in many cases the histopathological analysis is not limited to a single slide for a given patient. 15,16For example, colorectal cancer (CRC) resection specimen cases routinely comprise over 25 slides, and this number can increase when the tumour is large, numerous lymph nodes are identified, or immunohistochemistry (IHC) is required. 17lthough crucial, these slides are usually not labelled and it is not routinely reported which slides contain which tissue types.As a result, dozens of unlabelled slides are usually available for a single patient.WSIs have been used for a multitude of research applications such as molecular subtyping, 12,[18][19][20] survival prediction, 21,22 response prediction, 23 or to identify risk factors for lymph node metastasis, 24 but a manual selection step by an expert pathologist is usually required to select WSIs that contain the desired tissue type (tumour tissue, normal tissue, lymph node tissue, IHC, etc.) and are of good quality.Previous work has shown that "search and retrieve" approaches can be implemented by extracting visual features from high-resolution tiles generated from WSI. 25 However, this is computationally expensive.Pathologists can often identify the tissue slides without the aid of a microscope, by simply observing a glass slide with the naked eye.For instance, in CRC pathology human experts can easily distinguish tumour slides from lymph nodes or normal mucosa just by looking at the glass slide without any magnification.While it has been shown that DL models can efficiently identify tissue characteristics, such as lymph nodes using lowresolution images, 26 there is still a clear need for automated curation of large histopathological datasets via an algorithm that can efficiently recognize and classify different tissue types to presort large collections of WSIs for subsequent DL applications.It is only via the availability of such systems that advanced AI algorithms may be deployed in a fully automatic way in routine diagnostic workflows.
Therefore, we hypothesized that DL can assist the curation of large collections of WSI at a low Ó 2024 The Authors.Histopathology published by John Wiley & Sons Ltd., Histopathology, 84, 1139-1153.resolution, using only the "thumbnail" images.We developed and validated DL-based models to classify large and unsorted collections of WSIs-in our case, CRC cases-into tissue slide categories.We externally validated the performance of the model in two additional CRC datasets to provide definitive evidence for the generalizability of the model beyond the dataset it was initially trained on.To further investigate the performance of the model, we used relevant explainability methods like gradient-weighted class activation mapping (Grad-CAM) to gain insights into the features and regions of the input images that the model is relying on for its predictions.Additionally, we analysed the misclassified cases from a pathological point of view to define limitations of the model and potential areas for improvement.

E T H I C S S T A T E M E N T
This study was carried out in accordance with the Declaration of Helsinki.The collection of the tissue samples for the cohorts FOXTROT-CRC and CR07-CRC was granted by the Northern and Yorkshire Research Ethics Committee (Jarrow, UK; Unique Reference Number: 07/MRE03/24). 27,28][31] The overall analysis was approved by the Ethics Committee of the Medical Faculty of Technical University of Dresden (BO-EK-444102022).

P A T I E N T C O H O R T S
In this study we analyzed digital WSIs from three large multicentric patient cohorts (Figure 1A-F).All WSIs were stored in the SVS format.Details and clinicopathological characteristics of all the samples are shown in Table 1.We used the "Fluoropyrimidine, Oxaliplatin, and Targeted Receptor pre-Operative Therapy for colon cancer cohort" (FOXTROT, N = 1006 patients, N = 21,384 WSIs, Figure S1) 32 cohort as a training set and then used "Medical Research Council CR07" (CR07, N = 608 patients, N = 7985 WSIs, Figure S2) 28 and "Darmkrebs: Chancen der Verh€ utung durch Screening" study (DACHS, N = 2448 patients, N = 3606 WSIs, Figure S3) 29 cohorts as external test sets.All three cohorts represent multicentre, large-scale clinical trials / cohort studies.Histopathology slides for each cohort were mostly created using routine diagnostic pipelines at each specific trial centre.For the FOXTROT cohort, 90% of slides were created locally and 10% were created centrally at St. James University Hospital in Leeds, UK.For the CR07 cohort, 74% of slides were created locally and 26% were created centrally at St. James University Hospital in Leeds, UK.For the DACHS cohort, all tissue was processed at the individual trial centres.For all cohorts, slides were scanned using Leica Aperio slide scanners.Slides for FOXTROT and CR07 were scanned at St. James University Hospital in Leeds, UK.For DACHS, slides were scanned at the Tissue Bank at the National Center for Tumour Diseases (NCT) in Heidelberg, Germany.
From each WSI, we generated a low-resolution thumbnail image at a fixed resolution of 32 micrometres per pixel using an automated script.Thumbnails were saved in the JPEG format.Low-resolution thumbnail images were then resized to 224 9 224 pixels with zero padding to preserve proportions.No other preprocessing steps were applied to the images.
We aimed to train a DL model that is capable of classifying colorectal WSIs into seven sample categories including (1) tumour tissue (T), (2) nontumour tissue (NT), (3) lymph node (LN), (4) biopsy and endoscopic resection (BE), (5) fat, (6) IHC, and (7) tissue microarray (TMA) (full descriptions for each category can be found in Table S1).All classes except IHC represented slides stained with haematoxylin and eosin (H&E).During annotation, we removed cases that were too heavily artefacted or didn't fit into any of the defined classes.During model deployment, the class "Undecided" was assigned to a WSI if the prediction score of the classification model was below a defined confidence threshold.Due to the large size of the training cohort (FOXTROT), ground-truth labels were generated in a semisupervised way (Figure 1D).Two observers, a trainee pathologist (K.J.H.) and J.N.K. manually labelled a random subset (41%) of all FOX-TROT WSIs.This subset had 8814 WSIs (N = 5961 T, N = 1105 NT, N = 296 LN, N = 207 BE, N = 105 fat, N = 1008 IHC, N = 132 TMA).We used this accurate and error-free annotated subset of data to train a very simple convolutional neural network (CNN).The characteristics of this model can be found in the "Implementation and parameters" section.We used this simple classifier to generate noisy labels for the rest of the training cohort (59% of FOXTROT).In this stage, all the noisy labels assigned by DL were manually checked and corrected by the two observers.As a result of this procedure, we were able to more quickly and efficiently annotate 21,384 WSIs in the training cohort (i.e.FOXTROT).Labels for CR07 were manually generated by a trainee pathologist (K.J.H.) for every single image.Available categories in the CR07 cohort are T (N = 3130  NT (N = 1287) were present in the dataset.A very important aspect of the DACHS cohort is that ~82% of the WSIs contain very clear ink marks.Our purpose in selecting this cohort was to test the sensitivity of the model to possible artefacts on the WSIs.
We employed two experimental strategies: Strategy #1, supervised learning in which the full training cohort was used (Figure 1E).Strategy #2, supervised learning in which only a very small subset of labelled data from the training cohort were used during training.For this strategy, the number of instances in the training set was limited to 2, 4, 8, 16, 32, and 64 samples per class.For both strategies, we ran two experiments: Experiment #1, an "internal classification experiment" on the FOXTROT cohort only, in which we used threefold crossvalidation to assess within the cohort classification performance.Experiment #2, for which we retrained a classifier on the FOXTROT cohort and then externally tested the performances in CR07 and DACHS.Finally, we employed two different techniques for each experiment: Technique #1, classical transfer learning, in which a CNN model that was pretrained on the Ima-geNet database was retrained on the task at hand (Figure 1B).Technique #2, self-supervised pretraining, in which a pretrained CNN was first trained on FOXTROT in a self-supervised way (without labels) and later on retrained in a supervised way (with labels) (Figure 1C,F).Altogether, two strategies for two experiments and two techniques yielded eight separate experimental runs.
For the noisy label generating model we trained an ImageNet-pretrained Resnet18 network for a maximum number of 100 epochs with a batch size of 128, a learning rate of 10 À4 , and a weight decay of 10 À4 as per the default settings obtained in previous research projects. 33Early stopping was used with a minimum number of epochs of 30 and a patience value of 10.
Afterwards, for supervised training, we also trained an ImageNet-pretrained Resnet18 network, now using the fully curated labels, with the same parameters mentioned for the noisy label generating model.For self-supervised training we used SimCLR, 34 a method for contrastive self-supervised learning (SSL).We trained this network for 500 epochs with a batch size of 256, a learning rate of 10 À4 , and a weight decay of 10 À5 .Augmentations for contrastive SSL were applied to the images, as described previously 35 (Table S2).SSL was conducted using Python's Lightly package to set up the SSL method, including image augmentations, and Python's PyTorch Lightning package for model training.No further hyperparameter tuning was performed.

S T A T I S T I C S A N D E X P L A I N A B I L I T Y
The primary endpoint was the area under the receiver operating curve (AUROC).We assumed that an AUROC of above 0.90 would represent a very good classifier.Specifically, we used the macroaveraged AUROCs, where the AUROC for each class is calculated separately and then the average is taken across all classes.The 95% confidence intervals (CIs) of the AUROC values were calculated using the quantiles obtained through 1000-fold bootstrapping with resampling.We also calculated the overall accuracy of the network with fixed classification thresholds of 0.5 and 0.95 applied to the output neurons.These

D L R E A C H E S H I G H T I S S U E C L A S S I F I C A T I O N P E R F O R M A N C E B A S E D O N T H U M B N A I L I M A G E S
First, we assessed the predictability of tissue classes from thumbnail images in a multiclass classification approach.Our baseline approach, internal crossvalidation on the FOXTROT cohort (N = 21,384 slides), with a standard transfer learning approach, yielded a near-perfect AUC of 0.995 [95% CI: 0.994-0.996](Figure 2A and Figure 2F) and accuracy of 99.8% and 99.9% for predefined classification thresholds of 0.5 and 0.95, respectively.The number of classified cases-meaning cases that were classified by the model with a prediction probability above the chosen threshold-was 21,384 (100% of all cases) for the 0.5 confidence threshold and 21,336 (99.78% of all cases) for the 0.95 confidence threshold.Thus, most cases were still confidently classified by the classifier, even when considerably raising the confidence threshold for classification (Table 2, Table S3).

D L C L A S S I F I C A T I O N S G E N
Next, to assess how well the model generalizes, external validation was conducted on independent CRC cohorts, namely, CR07 (N = 7985 slides) and DACHS (N = 3606 slides).The model performance reached accuracies of 91.7% (Figure 2B) and 94.6% (Figure 2C) on the CR07 cohort, and 78.0%(Figure 2D) and 85.1% (Figure 2E) on the DACHS cohort for the predefined classification thresholds of 0.5 and 0.95, respectively.Macro-averaged AUROC for the CR07 cohort was 0.982 [95% CI: 0.979-0.985](Figure 2F) and 0.875 [95% CI: 0.864-0.887](Figure 2G) for the DACHS cohort.Similar to the results obtained in the internal crossvalidation experiment, a decrease in the number of classified cases for CR07 was also observed when applying the 0.95 confidence threshold with 7331 (91.81%) classified cases compared to 7948 (99.54%) for the 0.5 classification threshold.
For DACHS, the number of classified cases was 2752 (85.1%) for the 0.95 classification threshold and 3560 (98.72%) for the 0.5 classification threshold.
Based on these results, the model performs well even when deployed on datasets from different institutions.Remarkably, model performance remained good even on the DACHS cohort, which, as we mentioned earlier, contains heavy ink marks on many of its slides, further showing that our models are quite robust to slide artefacts and other confounders.
Furthermore, we investigated the performance of our models on the given task when trained on a comparatively smaller dataset.For this, we trained a supervised model with small subsets of the original dataset for training (2, 4, 8, 16, 32, and 64 samples per class).Overall accuracy remained above 0.5, with accuracies of 58.4% (0.5 threshold; 30.22% cases classified) and 99.4% (0.95 threshold; 0.80% cases classified) for internal validation on the FOXTROT cohort, 58.2% (0.5 threshold; 28.98% cases classified) for external validation on CR07, and 58.8% (0.5 Automated curation of large-scale cancer histopathology image datasets using deep learning 1145 threshold; 10.37% cases classified) for external validation on DACHS, even when the network was given only two cases of each class for training (Table 3).For few-shot experiments with these very small training datasets, the model accuracy for higher thresholds, such as 0.95, fluctuated significantly, since the number of confidently classified cases-with prediction values higher than the set threshold-was very small (one for CR07 and zero for DACHS for the n = 2 case).Together, these data highlights how models can be trained on relatively small amounts of data while still retaining good classification performances, even when working with low-resolution thumbnails.

S E L F -S U P E R V I S E D L E A R N I N G
Additionally, we explored training the model in a selfsupervised way with the aim of improving model performance even further.With the inclusion of an SSL step into our workflow, the model showed similar performance to the classic approach.Overall accuracy for testing on the complete datasets was very high, with accuracies of 99.8% (99.99% classified cases) and 99.9% (99.78% classified cases) for FOXTROT, 91.4% (99.70% classified cases) and 94.6% (91.98% classified cases) for CR07, and 79.5% (98.89% classified cases) and 85.5% (78.87% classified cases) for DACHS for both confidence thresholds, respectively (Table 2, Table S4).Similarly, when tested on the few-shot learning task, performance was comparable to the classic approach, with an overall accuracy that remained above 0.In general, the SSL approach showed at least parity to the classic approach for all conducted experiments.All data can be found in Table 4.

I D E N T I F I C A T I O N O F P O S S I B L E R E A S O N S F O R M I S C L A S S I F I E D S A M P L E S
In order to gain insight into the misclassified cases, we analysed the cases that were misclassified by our For each approach-one using a standard pretrained Resnet-18 model and one using a Resnet-18 model that was pretrained on the FOX-TROT dataset using a self-supervised learning (SSL) approach-three experiments were conducted.First internal validation on FOXTROT, then external validation on CR07 and DACHS.AUROC and accuracy values are given for each experiment for two different confidence thresholds, alongside the number of cases that were confidently classified by the model for each experiment.
classic approach model when applied to the CR07 dataset (0.5 confidence threshold; 662 cases; 8.3% misclassification rate).The most prevalent reason for misclassification (31.47% of misclassified cases) was the occurrence of key features of multiple classes within the image, which makes classification into one distinct class difficult.Most cases in this category were slides that contained mainly adipose tissue, but also included small lymph nodes that reasonably could be classified as "fat" as well as "lymph node".Other examples included slides that contained invasive tumours but also prominent lymph nodes.These cases were then often classified as "lymph node" even though a human pathologist would in all cases classify these slides as "tumour tissue", since that is the more clinically relevant class.The second most prevalent reason was misclassification (27.08%), which describes cases where, after a second review, the ground truth generated by the human pathologist was not correct.In most of these cases the model actually classified the cases correctly, but usually these cases also contained key features of multiple classes, which made finding a clear ground truth difficult.Misleading features accounted for 18.31% of misclassified cases.This category contains any slides with tissue features that apparently were misinterpreted by the model.For example, some cases in this category contained mucosa-associated lymphoid tissue alongside normal intestinal mucosa, which the model apparently interpreted as an invasive tumour, therefore labelling the case "tumour tissue".In other cases, tissue marking dye used to highlight resection Automated curation of large-scale cancer histopathology image datasets using deep learning 1147 margins was confused for invasive tumour growth.Slide artefacts made up 14.98% of all misclassified cases.These include ink markings, air bubbles, dust, and other contaminants.Staining issues are also included in this category.A small number of misclassified cases (8.17%) was due to tissue structures the model was not familiar with for a certain class.Among others this includes cases where normal intestinal mucosa was cut at an angle, leading to tissue architectures that differ from the norm or tumour cases where the tumour seemed to be surrounded by connective tissue from all sides (Figure 3).In summary, we were able to describe five different types of possible reasons that provide a likely explanation to why each case was misclassified.
To further provide explanations for the model's decisions and also to corroborate which factors contributed to misclassification of images, we decided to explore visual explanations by applying Gradientweighted Class Activation Mapping (GradCAM) to the images.GradCAM overlays a heatmap onto the original image, highlighting individual pixels and regions that were important for the model's classification decision.For example, size and shape appeared to be important in both the accurate classification and misclassification of biopsy and lymph node images.Grad-CAM revealed the models' ability to detect cancer cell invasion into subepithelial tissue as well as identifying tissue regions where normal epithelium transitions to invasive cancer.Additionally, GradCAM also highlighted some of the models' weaknesses.For example, subtle colour deviations within the tissue could sometimes lead to misclassification.GradCAM was also able to visualize when the model misinterpreted certain tissue features for features from a different class (Figure 4).In conclusion, this shows that while our models performed very well, there are still limitations to tissue classification.This is particularly true for borderline cases, where even finding a clear ground truth can be difficult, given that tissues are usually very complex in their structure and often may contain characteristics of multiple classes.Automated curation of large-scale cancer histopathology image datasets using deep learning 1149 magnification, complex pipelines are usually used to tessellate a gigapixel image, run predictions on individual tiles, and aggregate predictions, often with multiple instance learning (MIL). 3Such approaches have also been used for "search and retrieval" problems in computational pathology. 25However, these intricate workflows are highly computationally expensive, and therefore costly.Furthermore, software pipelines should follow the principle of Occam's razor: they should not be unnecessarily complex if a simple workflow is sufficient.Here, we pursued a much simpler approach: we used low-resolution thumbnail images of whole slides to perform the essential classification task of tissue classification.Surprisingly, even very simple approaches such as the thumbnail-based classification with few-shot learning with simple transfer learning yielded a near-perfect performance.Testing in external cohorts yielded a slightly lower performance, for which a partial remedy was the use of SSL to pretrain the models.Additionally, training and testing our models on datasets from multicentre, large-scale clinical trials, showing good performance across all cohorts, highlights the robustness and flexibility of our models, even when faced with the variability of real-world data.
Thus, we show that clinically relevant image classification tasks can be efficiently solved at very low resolution.Our approach also has significant potential for implementation into clinical workflows.In the busy histopathology department, our pipeline could be used to presort slides by clinical relevance.For example, in resection cases with a large number of slides, this would allow the pathologist to direct their immediate attention to images with higher diagnostic importance, such as tumour.

L O W -R E S O L U T I O N P R E S O R T I N G O F S L I D E S I N
The relevance of this text revolves around the significant role that our proposed approach could play in enhancing complex computational pathology biomarker studies.Currently, tasks such as molecular subtyping 4 pose a substantial challenge, and researchers often need to manually preprocess and select a single tumour-bearing tissue slide for further investigation.This process is both time-consuming and reduces the quantity of data that can be utilized in these studies.Our approach seeks to automate the preselection process of tumour-bearing tissue slides, thereby significantly expanding the pool of available data for these studies.It serves as a supportive tool to improve the efficiency of complex biomarker research by automating this preliminary task.This improvement becomes critically important when dealing with large clinical trials or clinical routine cohorts, which may contain tens of thousands, if not more, slides that require presorting.Automating this process paves the way for more efficient extraction of computational pathology biomarkers from large datasets, Our approach has a number of limitations.For instance, the way we created our ground truth labels was very simplistic.We assumed that every image belongs to exactly one class, but some slides contained two classes, such as a lymph node next to the primary tumour tissue.In such cases, tumour was given priority when assigning a class, as this was deemed the more diagnostically relevant.Additionally, some of this uncertainty can indeed be attributed to the complex architecture of these tissue samples, with a single slide of colon resection containing a plethora of different types of tissues.Together with the variety of tissue compositions found throughout the slides, this presents a challenging factor that ultimately makes it infeasible for some cases to be assigned a single class label.This limitation is aggravated by the fact that our models are incapable of multilabel classification, and will always output a single label for each slide.Furthermore, artefacts, pen markings, and other slide alterations are technical issues that are frequently present in these data formats, potentially limiting our approach.Establishing new practices that can improve these technical and human inaccuracies would inherently lead to even more robust performances of these kinds of tissue classification models.Nonetheless, we were able to show that, even across different cohorts, the results remained consistent, with only a moderate reduction in classification accuracy.To address some of these limitations we introduced a pretraining step using SSL.In all instances, this approach demonstrated parity to the classic approach and in certain cases even slightly improved on the performance of the classic model.Therefore, we would generally recommend the use of an SSL pretrained model for these kinds of classification tasks.

F U T U R E W O R K
In the present study we have demonstrated that the curation of large datasets can be accomplished through the utilization of thumbnail representations of WSIs and a CNN classifier.However, it should be noted that this approach has only been trained and tested within the context of CRC datasets.Therefore, future research must focus on assessing the model's generalizability to other types of cancers.A key direction for subsequent studies would be to validate this model across diverse cohorts of different cancer and tissue types.By doing so, we can confirm the applicability and robustness of this model beyond CRC, enhancing its ability to be used in various cancer research.Additionally, iterations of this model might be useful for identifying the most representative cases within a cohort in order to make clinical trials more efficient or for multilabel classification of tissue slides.Ultimately, the objective is to evolve this model into a universal tool that can expedite the curation of large datasets across all cancer and tissue types.By achieving this, we can significantly accelerate the processing time, a major bottleneck in medical AI research, and make data more readily available for the broader research community.
WSIs), NT (N = 397 WSIs), LN (N = 2800 WSIs), BE (N = 268 WSIs), fat (N = 1036WSIs), and IHC (N = 354 WSIs).Labels for DACHS were available in the original study database, where they had been added by pathologists of the National Center for Tumour Diseases (NCT) biobank at the Institute of Pathology of the University of Heidelberg, Germany.For DACHS, only the categories T (N = 2319) and
Discussion A R O L E F O R L O W -R E S O L U T I O N " T H U M B N A I L " I M A G E A N A L Y S I S I N C O M P U T A T I O N A L P A T H O L O G YComputational pathology studies almost exclusively address the classification of WSIs at high magnification.Due to the gigapixel size of images at this

Figure 3 .
Figure 3. Misclassified cases.(A) Pie chart showing the distribution of reasons for misclassification (standard approach model deployed on CR07.Classification threshold 0.5.Number of misclassified cases N = 662).(B) Marking dye confused for invasive tumour cells.(C) Invasive tumour was not detected because of staining issues.(D) Large endoscopic resection tissues and prevalent invasive tumours lead to misclassification.(E) Tissue from the same patient as in (D) but classified differently.

Table 1 .
Clinicopathological features of all cohorts n, number of cases; SD, standard deviation; WSI, whole slide image.*Ageandgender data refers to the whole FOXTROT cohort.Subset data were not available for this study.Ó 2024 The Authors.Histopathology published by John Wiley & Sons Ltd., Histopathology, 84, 1139-1153.C O D E A V A I L A B I L I T YAll source codes and trained models for DL are open source and available at https://github.com/KatherLab/thumbnail-classification.A script for automated thumbnail generation is available at https:// github.com/KatherLab/preprocessing-ng.

Table 2 .
Strong supervision experiments using all the samples

Table 3 .
Few-shot learningModel pretrained on ImageNet.Few-shot learning experiments were conducted using two cases as a minimum and 64 cases as a maximum number of cases for model training.First internal validation on FOXTROT, then external validation on CR07 and DACHS.AUROC and accuracy values are given for each experiment for two different confidence thresholds, alongside the number of cases that were confidently classified by the model for each experiment.

Table 4 .
Few-shot learningModel pretrained using pathology SSL.Few-shot learning experiments were conducted using two cases as a minimum and 64 cases as a maximum number of cases for model training.First internal validation on FOXTROT, then external validation on CR07 and DACHS.AUROC and accuracy values are given for each experiment for two different confidence thresholds, alongside the number of cases that were confidently classified by the model for each experiment.