Deep learning based digital cell profiles for risk stratification of urine cytology images

Urine cytology is a test for the detection of high‐grade bladder cancer. In clinical practice, the pathologist would manually scan the sample under the microscope to locate atypical and malignant cells. They would assess the morphology of these cells to make a diagnosis. Accurate identification of atypical and malignant cells in urine cytology is a challenging task and is an essential part of identifying different diagnosis with low‐risk and high‐risk malignancy. Computer‐assisted identification of malignancy in urine cytology can be complementary to the clinicians for treatment management and in providing advice for carrying out further tests. In this study, we presented a method for identifying atypical and malignant cells followed by their profiling to predict the risk of diagnosis automatically. For cell detection and classification, we employed two different deep learning‐based approaches. Based on the best performing network predictions at the cell level, we identified low‐risk and high‐risk cases using the count of atypical cells and the total count of atypical and malignant cells. The area under the receiver operating characteristic (ROC) curve shows that a total count of atypical and malignant cells is comparably better at diagnosis as compared to the count of malignant cells only. We obtained area under the ROC curve with the count of malignant cells and the total count of atypical and malignant cells as 0.81 and 0.83, respectively. Our experiments also demonstrate that the digital risk could be a better predictor of the final histopathology‐based diagnosis. We also analyzed the variability in annotations at both cell and whole slide image level and also explored the possible inherent rationales behind this variability.

diagnosis. Accurate identification of atypical and malignant cells in urine cytology is a challenging task and is an essential part of identifying different diagnosis with low-risk and high-risk malignancy. Computer-assisted identification of malignancy in urine cytology can be complementary to the clinicians for treatment management and in providing advice for carrying out further tests. In this study, we presented a method for identifying atypical and malignant cells followed by their profiling to predict the risk of diagnosis automatically. For cell detection and classification, we employed two different deep learning-based approaches. Based on the best performing network predictions at the cell level, we identified low-risk and high-risk cases using the count of atypical cells and the total count of atypical and malignant cells. The area under the receiver operating characteristic (ROC) curve shows that a total count of atypical and malignant cells is comparably better at diagnosis as compared to the count of malignant cells only. We obtained area under the ROC curve with the count of malignant cells and the total count of atypical and malignant cells as 0.81 and 0.83, respectively. Our experiments also demonstrate that the digital risk could be a better predictor of the final histopathology-based diagnosis. We also analyzed the variability in annotations at both cell and whole slide image level and also explored the possible inherent rationales behind this variability.

| INTRODUCTION
Bladder cancer is known to be the ninth most commonly occurring malignancy globally, with around 430,000 new cases reported in 2012 [1]. Urine cytology is considered to be an important detection tool for identifying malignancies in the urinary tracts such as bladder cancer. It is widely used to identify high-grade urothelial cancer (HGUC) and is not encouraged to be used for low-grade carcinoma due to its low sensitivity to it. In clinical practice, pathologists observe cytology slides under the microscope and identify atypical and malignant cells.
Based on the morphology of these cells, a diagnosis is made leading to decision making for treatment.
Unlike histology, the digital adoption for urine cytology has been impeded due to the lack of scanner's ability for z-stacking along with other limitations related to cytology. The tissue material for histology has a relatively uniform thickness whereas the cytology material is less evenly distributed with variable thickness of different cell clusters in a 3D configuration. For this reason, pathologists would frequently need to focus on different planes to view all the cells. It has been demonstrated that the availability of more than one focal plane on digital cytology slides helps with the diagnostic interpretation [2]. Z-stacking enables the user to look at the sample at different focal planes which is a built-in property of the microscope. With the advancement in whole slide scanners, different vendors have started to provide imaging system with an ability for z-stacking which has motivated the pathologists to scrutinize digital cytology in clinical practice. However, it comes with a cost of much larger image file size and longer scanning time [3].
Similar to histology, diagnosis of cytology cases suffer from high inter and intraobserver variability [4]. In addition to variability in the assessment of urine cytology, different terms for the same entities are being used at both individual pathologists and institutional level. This led to the development of The Paris System (TPS) to provide a consistent and reliable diagnostic tool. An international working group, comprising of expert cytopathologists, urologists, and surgical pathologists, provided criteria for reporting different diagnostic categories including recommendations for HGUC which is the main purpose of urine cytology. TPS was officially released in 2016 [5] and is now accepted worldwide. TPS has shown significant improvement in the assessment of urine cytology specimens with adequate precision for negative cases. However, studies [6][7][8][9] conducted on the interobserver variability demonstrated poor interobserver agreement for other categories. In [9], different distribution of categories was reported by five cytopathologists on reviewing 149 cases independently. The interobserver variation makes the automated diagnosis of cytology samples challenging.
In a clinical setting, a cytology specimen is examined manually, under a microscope using a glass slide. Like histology samples, urine cytology slides can be visualized on a computer screen after digitization which is used by occasional labs. The uptake of digital cytology can encourage the assisted assessment of specimen with computergenerated results. This would result in the emergence of quantitative algorithms for analysis, hence enabling the clinicians to obtain nonsubjective and reproducible outcomes.
The main goal of this study is to investigate an automated alternative to risk stratification of urine cytology slides. The status quo based on subjective visual analysis is prone to human error and has a large inter-and intraobserver variability. Therefore, there is a need to investigate its limitations wrt the intrinsic difficulty of the problem (in both diagnostic and technical terms). Our contribution in this paper is fivefold. First, we collected cell-level annotations in an iterative way to improve the generalizability of the model. TPS criteria were used by the expert pathologists for labeling. Second, we explored two different approaches for cell detection and classification and employed the best one for whole slide image (WSI) labeling. Third, we presented a cell count-based approach for identifying high-risk cases. Fourth, we investigated the interobserver agreement at WSI level and intraobserver variability at the cell level. Lastly, we investigated the cytopathology based risk category and our digital risk labeling in correlation with the "gold standard" histopathology based diagnosis.
In the next section, we review previous work on cytology images.
In Section 2, we describe the details of our dataset and our methodology for cell segmentation, detection and classification. In Section 3, we present our results at both cell and WSI level. In Section 4, we discuss our findings while Section 5 concludes this study.

| Related work
In the literature, very few studies can be found on automatic analysis of cytology images in comparison to the work in histology image analysis. Recently, there has been some work on cell detection, classification, and segmentation from cytology images. In [10], GoogleNet and Zhang et al. [11] presented a simple convolutional neural network (CNN) to classify cervical cells in a pap-smear cytology image without any prior cell segmentation. Their training set comprises patches of fixed size with nucleus located in the center of the patch. This means that the network was trained with patches containing partial cell content. In another study [12], a simple CNN is used to classify the cells in nasal cytology into one of the seven classes. To train the network with patches containing whole cell content, they perform cell segmentation via the Otsu algorithm followed by morphological operation and watershed algorithm. To overcome the problem of unbalanced classes, they opted random majority undersampling method. Wu et al. [13] employed AlexNet-based network to identify different types of ovarian cancer from cytological images captured from different parts of the tissue sample. These images were then divided and resized into smaller patches for training.
One recent study [14], which integrates deep learning and morphometric approaches, focuses on automating TPS for the analysis of urine cytology images. Deep learning is used to assign atypia score to a given cell while a morphometric approach computes the nucleus to cytoplasmic ratio. They employed thresholding to segment cellular content, followed by connected component analysis for extracting cell patches. Based on their cell classification approach, they have proposed a condensed grid format for an image reconstruction which is less cellular and smaller in size in comparison to the original image.
The authors have also illustrated the prediction of high-risk cases based on the cutoff for their employed cell morphological features.
Sanghvi et al. [15] presented a deep learning-based pipeline for classifying urine cytology images into five TPS categories which can further be divided into low and high-risk classes. QuPath was used to detect cells in a WSI and a patch of fixed size was extracted from the center.
The authors employed both cell-level and slide-level features for WSI classification and validated it using a large cohort. To the best of our knowledge, [14,15] are the only studies on risk stratification.
There has been some effort in separating the overlapping cells from both 1-plane and z-stacked cytology images and is not limited to [16][17][18][19] and [20]. In our study, we perform segmentation to extract both individual cells and the cluster of cells. This is to ensure that the whole cell or a cluster is captured inside the bounding box. Therefore, separating the overlapping cells is not necessary for our approach.

| Specimen collection, digitization, and labeled data preparation
The cytology slides used in this study and the associated clinical data were obtained from the University Hospitals Coventry and Warwickshire (UHCW) NHS Trust in Coventry, UK. The dataset was provided after deidentification and informed consent was obtained from the patients. Each slide was labeled as normal, inflammatory, cytological atypia (CA), atypia suspicious for malignancy (ASM), or transitional cell carcinoma (TCC). In this paper, we use the term "ref- erence" for the diagnostic information obtained from the UHCW and does not necessarily mean that it was decided by a single pathologist. All the slides were prepared using a liquid based cytology method, ThinPrep and were scanned at 0.275 mm per pixel. The maximum resolution is 40×. In total, we obtained 398 slides, comprising 243 normal, 13 inflammatory, 76 CA, 38 ASM and 28 TCC.
These slides were scanned using an Omnyx VL120 scanner to form a multilayered pyramid enabling the user to visualize the slide at different resolutions.

| Creation of labeled dataset
We obtained cell-level annotations from an experienced pathologist and a recently trained pathologist. Both pathologists followed TPS criteria for labeling cells as normal, atypical, or malignant urothelial cells. Other cell types present in urine (e.g., squamous, inflammatory, etc.) were also annotated. Degenerated cells and cells that pathologists were uncertain about were annotated as "others." The variations in annotations affect the performance of a trained classifier. We did the interobserver variability analysis between two pathologists to find out the highly concordant classes. A set of same visual fields were presented to both pathologists for independent annotations. High concordance score was observed in normal, squamous, and inflammatory classes. Considering the variability in the rest of the classes, we

| Balancing of labeled dataset
The dataset obtained after initial annotations suffered unbalanced distribution. Due to unbalanced classes, a classifier does not tend to perform well for the minority classes as it does not get sufficient look at them. To balance the distribution in the training set, we employed oversampling technique known as synthetic minority over-sampling technique (SMOTE) [21]. In our initial dataset, it was atypia and debris

| ROI extraction from whole slide image
In histology slides, a relatively small region of the slide contains tissue and to reduce the computation time, the tissue region is identified to avoid processing the background white region. To exclude the background region, thresholding can be used for histology images at a low resolution. Like histology images, in the urine WSIs, the cellular content is confined to a limited portion of the slide. However, unlike histology images, thresholding at low resolution would omit cells in WSIs with a fewer number of cells.
In Figure 1(A), an example of urine cytology slide from our dataset is shown at a low-resolution level. The area inside the two fiducial marks contains cells while the remaining area is noncellular. Hence, the region outside these two fiducials should be excluded from processing to reduce the computation time. To achieve this, we adopt the Otsu thresholding [22] which determines a threshold value by maximizing the interclass intensity variance. Specifically, we first convert the RGB images into a grayscale image and then an optimal threshold value is estimated using the Otsu algorithm. A number of other objects such as the text on the slide and other artifacts were identified using this threshold value. These were excluded based on the area-based threshold. The resulting mask for the ROI is shown in Figure 1(A) and it was carried out at a resolution level of 5×.

| Cell segmentation
To identify candidate cells, we separated the cellular content from the background using the thresholding technique. We selected a global cutoff using the Otsu thresholding, resulting in a segmentation map for individual cells and cell clusters. We followed a simple process for obtaining this value which is explained in Algorithm 1. To find an optimal threshold value, the image should contain cells representing the whole population. We employed k-medoid clustering [23] to select exemplary cell patches from each class. We set k = 20 resulting in 20 clusters per class since we needed 20 exemplar patches from each class. A sample closer to the medoid of the cluster was added to the exemplar bucket. Using these exemplar patches, a big synthetic image was generated by randomly placing the exemplar patches on a plain background image retrieved from one of the WSIs. This image was then converted to HSV from which the saturation channel was used to find the threshold value using Otsu thresholding. A generated segmentation map for an example visual field is shown in Supplementary   Figure 3.

| Cell detection and classification
In our study, we applied two different approaches for identifying different types of cells. These approaches comprise of first detecting the candidate cells either by thresholding or using CNN. The candidate cells are then classified using different CNNs.

| Our approach
Training data preparation The annotations were obtained at WSI level and the patches of different sizes were extracted from the images depending on the size of the candidate cells at 40× resolution. For cell clusters, the whole region surrounded by the polygon or rectangle was extracted while individual cells for which a dot was placed around the center of the cell were captured in a different way. For a given dot, cell segmentation mask was generated for a patch of size 500 × 500 with a dot in its center, followed by a connected component analysis. A component having a dot inside or on its boundary was considered as a candidate cell. A patch capturing the whole candidate cell was extracted and was saved to the hard drive as an input to the classification network along with its label information.

Methodology details
In our approach, we applied global thresholding to segment the candidate cells, as explained in Section 2.3. The generated mask was further processed with hole-filling and area-based object removal to avoid artifacts. The connected component analysis was performed to compute the bounding box for each identified object in the mask. The bounding box was then used to collect input data for the classification network.
For classification, we employed Xception which is the extension of inception network [24], with depthwise separable convolution opera-

| RetinaNet detection and classification
Training data preparation First, we randomly extracted a background image of size 5000 × 5000 from one of the WSIs; then, the cell patches used in our previous approach were randomly placed on it. The background white patches were excluded while training this network.

Methodology details
In our second approach, we employee an object identification method for simultaneous detection and classification of cells. There are a number of one-stage and two-stage object detectors, not limited to [25][26][27][28][29][30][31][32]. We use a one-stage detector which has been shown to perform well in terms of both speed and accuracy, known as RetinaNet [32]. One-stage detectors are faster than two-stage detectors but do not perform well comparatively due to the class-imbalance problem.
In [32], the class imbalance problem is tackled using a novel focal loss.
We used ResNet as a backbone network for our experiments. We have used a publicly available code for RetinaNet for our experiments (https://github.com/fizyr/keras-retinanet).

| WSI-level classification
The clinical data used in this study comprises TPS categories assigned by our cytopathologists to each WSI of a cytology slide. The ground Algorithm 1 Threshold selection for candidate cell segmentation truth (GT) risk-based labels are derived from the relative risk associated with categories outlined in [4]. It is defined in relation to the extent of follow-up needed which segregates the cases with a high risk of malignancy for more aggressive follow-up. We considered the stated percentage of risk to generate the GT information for classification of samples into low and risk cases. We put all the cases with risk less than 50% to be in low-risk class and the cases with a risk higher than 50% to be in high-risk class. The low-risk class comprises Normal, Inflammatory, CA cases while high-risk class contains ASM and TCC cases. There were some images in our dataset that were not scanned properly and were not in focus. We excluded these images by setting a threshold on the number of all identified cells except debris in relation to the count of cells predicted as debris. Using our system, we stratified these cases with the count of atypical and malignant cells.
We also conducted some additional experiments with different cell profiling which are listed in the Supplementary Material Document.

| Cell-level classification
We evaluated our results obtained with Xception and RetinaNet using commonly used measures, along with results of some other CNNs, that is, VGG, MobileNet, Inception, and ResNet. All these networks were initiated with the pre-trained weights for ImageNet. The The area under the curve is found to be 0.99.

| WSI-level classification/risk assessment
The UMAP projection of the count of all seven categories of cells is shown in Figure 3. atypical cells. We also verified the network predictions for some benign cases for which the number of malignant cells predicted by the network was greater than 10. We found that some of these cases had reactive normal cells and cells with fluffy cytoplasm. This could be improved by adding these challenging cases to the training set.
In this study, we demonstrated the potential promise of an automated risk stratification method. There are some limitations of the proposed method related to how the data were obtained. We

| Annotation variability
We sourced cell-level annotations from two pathologists. The inconsistency in their annotations can undermine the performance of the model, given the model has a tendency to learn the complexity. To inquire about the inconsistency in the labeled dataset, we randomly selected some cells from our validation set and asked the expert pathologist to reannotate them. We selected these cells from our more concerned classes, normal, atypia, and malignant. We selected atypia and malignant classes since these are important in terms of making a diagnosis. The normal class was selected since it was mostly misclassified as atypia by the network. The variability in the annotations of the same pathologist is demonstrated in Table 1. In addition to the slide quality and the lack of multiple focal planes, the intraobserver variability could be due to pathologists' lack of experience with the digital slides for urine cytology. The intraobserver variability is a recognized issue in cytology. However, sourcing the labeling with consensus among different pathologists in an effort to improve the variability will improve the performance of the model.

| Performance of RetinaNet
There is a huge difference between the performance of RetinaNet with ResNet and a ResNet followed by the cell segmentation. This is partially due to the limitation of the detector in the RetinaNet, missing several cells. Additionally, the detector resulted in many bounding boxes for a single candidate object. On choosing a bounding box with predicted label with the highest probability, further increases the number of missing cells. In our validation set, we had 5175 cell samples, out of which 68 cells were missed when no detected object was ignored. However, selecting the predictions with a probability greater than 50% resulted in 692 cells to be missed. Contrary to it, the threshold-based segmentation does not miss any cell, except that it may fail to segment the whole cell, particularly squamous cell.

| Correlation between cytology and histology
We also studied a correlation between the cytopathology-based diagnosis and histopathology-based diagnosis. We obtained histopathology diagnosis for 48 cases along with their cytology and histology reports. These cases comprises 26 CA and 37 ASM, diagnosed using the cytology slides. We hypothesized that the cases for which network predicted more number of atypical and malignant cells would be diagnosed as malignant on performing histology. We observed a trend of association between cell count and histopathology-based diagnosis as shown in Supplementary Figure 7. We compared the results of cytopathology based risk category and our digital risk labeling against the "gold standard" histopathology-based diagnosis. The confusion matrix for manual cytopathology-based risk versus manual histopathology based diagnosis is shown in Supplementary Table 3. As can be seen in Figure 5, the digital risk could be considered a better predictor of the histopathology-based diagnosis. However, this needs to be validated with a large-scale multicenter study. To study it further, we looked into the cytology and histology reports of some of these cases to understand the grounds for the possible discrepancies between cytology and histology diagnosis. We came up with the following rationales for the discrepancies: (1)   of which are already considered as contentious and borderline, rather than between malignant and normal cells. This is similar to findings reported in [6][7][8]. Considering this variation, the ROC obtained in this study could vary on testing the proposed method with WSI labels obtained from a different pathologist. The interobserver variation in labeling cells and WSIs makes the automated diagnosis of cytology samples challenging.

| CONCLUSION
In this study, we found that the count of atypical and malignant cells is more robust in discriminating between low and high-risk cases as

SUPPORTING INFORMATION
Additional supporting information may be found online in the Supporting Information section at the end of this article.