Automatic coronavirus disease 2019 diagnosis based on chest radiography and deep learning – Success story or dataset bias?

Abstract Purpose Over the last 2 years, the artificial intelligence (AI) community has presented several automatic screening tools for coronavirus disease 2019 (COVID‐19) based on chest radiography (CXR), with reported accuracies often well over 90%. However, it has been noted that many of these studies have likely suffered from dataset bias, leading to overly optimistic results. The purpose of this study was to thoroughly investigate to what extent biases have influenced the performance of a range of previously proposed and promising convolutional neural networks (CNNs), and to determine what performance can be expected with current CNNs on a realistic and unbiased dataset. Methods Five CNNs for COVID‐19 positive/negative classification were implemented for evaluation, namely VGG19, ResNet50, InceptionV3, DenseNet201, and COVID‐Net. To perform both internal and cross‐dataset evaluations, four datasets were created. The first dataset Valencian Region Medical Image Bank (BIMCV) followed strict reverse transcriptase‐polymerase chain reaction (RT‐PCR) test criteria and was created from a single reliable open access databank, while the second dataset (COVIDxB8) was created through a combination of six online CXR repositories. The third and fourth datasets were created by combining the opposing classes from the BIMCV and COVIDxB8 datasets. To decrease inter‐dataset variability, a pre‐processing workflow of resizing, normalization, and histogram equalization were applied to all datasets. Classification performance was evaluated on unseen test sets using precision and recall. A qualitative sanity check was performed by evaluating saliency maps displaying the top 5%, 10%, and 20% most salient segments in the input CXRs, to evaluate whether the CNNs were using relevant information for decision making. In an additional experiment and to further investigate the origin of potential dataset bias, all pixel values outside the lungs were set to zero through automatic lung segmentation before training and testing. Results When trained and evaluated on the single online source dataset (BIMCV), the performance of all CNNs is relatively low (precision: 0.65–0.72, recall: 0.59–0.71), but remains relatively consistent during external evaluation (precision: 0.58–0.82, recall: 0.57–0.72). On the contrary, when trained and internally evaluated on the combinatory datasets, all CNNs performed well across all metrics (precision: 0.94–1.00, recall: 0.77–1.00). However, when subsequently evaluated cross‐dataset, results dropped substantially (precision: 0.10–0.61, recall: 0.04–0.80). For all datasets, saliency maps revealed the CNNs rarely focus on areas inside the lungs for their decision‐making. However, even when setting all pixel values outside the lungs to zero, classification performance does not change and dataset bias remains. Conclusions Results in this study confirm that when trained on a combinatory dataset, CNNs tend to learn the origin of the CXRs rather than the presence or absence of disease, a behavior known as short‐cut learning. The bias is shown to originate from differences in overall pixel values rather than embedded text or symbols, despite consistent image pre‐processing. When trained on a reliable, and realistic single‐source dataset in which non‐lung pixels have been masked, CNNs currently show limited sensitivity (<70%) for COVID‐19 infection in CXR, questioning their use as a reliable automatic screening tool.


INTRODUCTION
While vaccination programs are being rolled out, coronavirus disease 2019 (COVID-19) maintains a strong grip on society worldwide. 1 To limit the infection rate and avoid overburdening health care facilities, fast and effective screening and diagnosis remain critical in the fight against the severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2). 2 Next to reverse transcriptase-polymerase chain reaction (RT-PCR) testing -the current gold standard for diagnostic confirmation, both planar chest radiography (CXR) and computed tomography (CT) have been proposed as diagnostic solutions. [3][4][5][6][7] Although the European Society of Radiology and European Society of Thoracic Imaging strongly advised against the use of CXR as a first-line diagnostic technique, several early studies found that patients do present with abnormalities in CXR characteristic of COVID-19. [8][9][10] Together with the other benefits of this imaging modality, that is, relatively low cost and radiation dose, wide availability, speed and portability, these studies have led to the suggestion that CXR might be an ideal candidate for triaging patients presenting to hospitals, especially in epidemic areas. 11 CXR for COVID-19 diagnosis however still requires expert radiologists (>10 years of experience) to interpret the images with high specificity, a bottleneck in the workflow that is both time consuming and costly. 12,13 To overcome this issue, the artificial intelligence (AI) community has presented numerous machine -and deep learning (DL)-based image analysis tools that are able to automatically differentiate between COVID-19 positive and negative patients based on a single CXR, with reported accuracies and sensitivities often well over 90%. [14][15][16][17][18][19][20][21][22][23][24][25][26][27][28][29][30][31][32][33] One of the first of such networks was COVID-Net, reaching 93.3% accuracy on the test set of their publicly available dataset termed COVIDx. 34 As large single hospital CXR datasets of both COVID-19 positive and negative patients are scarce, researchers looking into these DL methods have often made use of a combination of publicly available repositories. [35][36][37] However, this approach can increase the risk of hidden biases that may lead to overly optimistic results. [38][39][40][41][42][43][44] The likelihood of such a bias is particularly high when the data per class originates from different sources, such as different countries, hospitals, or imaging systems. 45,46 In these cases, underlying differences in the image data distributions,due to for example a difference in image acquisition parameters, postprocessing operations, or overall patient characteristics unrelated to COVID-19, might create spurious correlations. Especially when these differences are more obvious than the COVID-19 disease features, they are likely to be exploited by the neural network (NN). This phenomenon is known as short-cut learning and hampers the NNs' generalization capabilities significantly. 47,48 Jabbour et al. for example showed that NNs can accurately identify patient attributes in CXRs such as sex and age, and the NNs tend to exploit correlations between these attributes and the outcome label when learning to predict a diagnosis, leading to poor performance when such correlations do not hold in the test population. 46 In a study by Kim et al., underlying differences in dataset distributions of commonly used COVID-19 CXR datasets were visualized through principal component analysis and t-distributed stochastic neighbor embedding. 48 Excellent performance during internal validation and poor performance in external validation showed these differences were likely exploited by the NNs during training. In the same context but by using state-of -the-art techniques in explainable AI, DeGrave et al. also showed that NNs are more likely to rely on confounding factors rather than relevant pathology. 47 The aim of this study was to qualitatively and quantitatively investigate to what extent possible dataset bias has influenced the performance of a range of promising deep convolutional NNs (CNNs) that were previously proposed for COVID-19 diagnosis in CXRs. In addition, through the creation of saliency maps and additional pre-processing, we aimed to define what exactly caused the dataset bias in a widely used COVID-19 CXR dataset, to support the development of bias elimination methods. Finally, we determined what performance can be expected with current CNN architectures on a reliable dataset that carries a low risk of dataset bias and is publicly available.

MATERIALS AND METHODS
Five deep CNNs that were previously proposed for automatic COVID-19 positive/negative classification were implemented for evaluation: (1) VGG19, 17,19,49 (2) ResNet50, 15,50 (3) InceptionV3, [51][52][53] (4) DenseNet201 54-56 , and (5) COVID-Net. 34 As can be seen in Table S1, these CNNs cover a broad range of layers and number of trainable parameters, while also differing in topology (e.g., skip connections in ResNet50, parallel connections in InceptionV3, or multiple direct connections from previous layers in DenseNet201). Of these five CNNs, only COVID-Net was specifically designed for COVID-19 detection on CXRs. COVID-Net was developed as an open-source initiative and several versions are publicly available including a binary (COVID-19 positive/negative) and multi-class (no pneumonia/non-COVID-19 pneumonia/COVID-19 pneumonia) classification network. 57 Pre-trained models are available online, together with several scripts to pre-process the corresponding dataset and train and test the network. In this study, COVID-Net for binary classification was implemented as provided online without any modifications and the latest pretrained model weights ('COVID-Net-CXR-2′, released on 20 March 20 2021) were downloaded. CNNs 1-4 were applied as implemented in Keras using the Tensorflow backend. 58,59 All four networks were pre-trained on ImageNet and further trained (all layers) on the datasets described in the next paragraphs. Hyper-parameters were optimized for each dataset using 20% of the training sets as validation and early stopping based on the validation loss was applied. For reproducibility, a detailed overview of all hyper-parameters is given in Table S1.
To perform both internal and cross-dataset (i.e., mimicking external) evaluations to quantify generalizability and to evaluate the influence of multiple sources in a single dataset, four COVID-19 positive/negative datasets were created as illustrated in Figure 1: (1) Valencian Region Medical Image Bank (BIMCV), (2) COVIDxB8, (3) BIMCV+/COVIDx-, and (4) COVIDx+/BIMCV-. The original datasets from which BIMCV and COVIDxB8 were created represent two of the largest publicly available datasets of COVID-19 medical images and are therefore often used in studies investigating DL for automatic COVID-19 diagnosis.
BIMCV was created from a single online source of CXRs, namely the Medical Imaging Databank in the BIMCV-COVID-19 dataset which contains both CXRs as well as CT data. 35 CXRs in this dataset originate from 11 medical centers in the Valencian region (Spain) Lastly, to increase the understanding of the CNNs decision-making, saliency maps that link the CNN classification outcome to the areas in the input image that had the most impact on that outcome were created. [61][62][63][64] Through this attribution method, a qualitative sanity check can be performed by evaluating whether the high-impact areas correspond to relevant areas inside the lung as opposed to improper information in the images (e.g., areas outside the lung, embedded symbols, etc.). XRAI, a region-based attribution method based on Integrated Gradients was applied and saliency maps showing the most salient segments (top 5%, 10%, and 20%) of each CXR were visualized and qualitatively evaluated. 65,66 Further, in an additional experiment, the pixel values outside the lungs were set to zero through automatic lung segmentation, using a U-Net CNN, on all CXRs before training and testing. As such, the CNNs were forced to use only relevant parts of the anatomy and could not rely on embedded text or symbols which are typically present outside the lungs. This is to further investigate the source of possible dataset bias and a potential solution. Similar results to those obtained with COVIDxB8 were obtained for all CNNs when trained on one of the other combinatory datasets (e.g., BIMCV+/COVIDxand COVIDx+/BIMCV-). Moreover, it can be observed that when a CNN is trained on BIMCV+/COVIDx-and evaluated cross-dataset on COVIDx+/BIMCV-(or vice versa), that is, when the origin of the two classes is switched between training and testing, CNNs perform worse than if classification would have occurred at random, with sensitivities ranging only from 0.04 to 0.08 (indicated in bold in Table 1). These results strongly suggest that when trained on a dataset in which the classes originate from different (online) sources, the CNNs learn confounding factors related to the data source rather than medically relevant pathology in the CXRs. Table S2 lists the classification precision that will be obtained with CNNs 1-4 when 90% sensitivity for COVID-19 positive detection is required. A similar trend in results as with the previous metrics can be observed across the different datasets and for all CNNs. Large discrepancies exist between the internal and cross-dataset TA B L E 1 Coronavirus disease 2019 (COVID-19) positive precision/recall obtained on the unseen test set of each dataset in both the internal (grey shading) and cross-dataset evaluation. BIMCV+/COVIDx-and COVIDx+/BIMCV-were created by combining the opposing classes from the Valencian Region Medical Image Bank (BIMCV) and COVIDxB8 datasets. Numbers in bold indicate the particularly poor performance when the origin of the two classes is reversed between training and testing When CXR pixels outside the lungs are masked before training and testing, the classification performance of CNNs 1-4 when trained on COVIDxB8 does not change (Table 2 vs. Table 1). This indicates the dataset bias remains and is due to differences in overall intensities (e.g., contrast, noise, etc.), rather than embedded symbols or text. For BIMCV and all evaluated CNNs, masking irrelevant parts of the CXRs led to more consistent results between internal and

DISCUSSION
The evaluation of five distinct CNNs that were previously proposed for automatic COVID-19 diagnosis on CXR showed quantitative results that were highly dependent on the applied dataset (Table 1). Moreover, all CNNs failed a qualitative sanity check on all datasets, despite consistent performance between internal and external evaluation when trained on the single-source dataset (Figures 2 and 3). The five CNNs evaluated in this study were selected to represent a broad range of trainable parameters, number of layers, and topologies (Table S1). However, before discussing the results it should be noted that there is no certainty the results obtained with these models are representative of all NN architectures. Similarly, the two datasets selected for this study, while representing some of the most used publicly available datasets on the topic, might not be representative for all datasets.
Quantitatively, all CNNs showed similar performance (Table 1). However, an extensive evaluation of COVID-Net is limited as only pre-trained models are available. While the network performs well when trained and tested on the COVIDxB8 dataset, the quality of the latter is questionable. Large discrepancies between internal and cross-dataset evaluations, seen with each of the five CNNs, indicate the CNNs are able to learn other patterns in the dataset that distinguishes the two classes, but that is not related to the presence of COVID-19 infection. These results persist even when pixels outside the lungs are masked before training and testing, despite an identical pre-processing workflow in which the image intensities are normalized and spread out homogeneously over a fixed intensity interval through histogram equalization.
Tartaglione et al. previously warned for possible hidden bias when combining different datasets, noting that NNs might find spurious correlations in different imaging parameters between datasets instead of looking at the actual disease. 40 The latter was also confirmed by Maguolo and Nanni, who showed deep NNs could still identify the origin of the CXRs while the lung  regions were excluded from the images. 67 Of interest, the dataset of CXRs used by Maguolo and Nanni is included in COVIDxB8. The presence of a hidden bias is further confirmed by the cross-dataset evaluation of datasets 3 and 4 in this study, showing that the CNNs continue to classify images according to the dataset they belong to, instead of the presence or absence of disease (Table 1, numbers in bold). A qualitative sanity check through the use of saliency maps also confirms the CNNs decision-making is largely based on regions outside the lung, including but not limited to embedded text and/or symbols, instead of COVID-19 lesions or healthy lung tissue (Figures 2 and 3).
Although the authors of COVID-Net have also used an explainability approach and their qualitative results indicated COVID-Net often used relevant areas in the CXR for decision making, results in the current study indicate a quantitative external validation remains crucial. 68,69 It is therefore recommended to limit the use of COVIDx and other combinatory datasets in their current formpending novel pre-processing techniques that are able to robustly eliminate dataset bias and to interpret the results of models trained on such datasets with care. 70 By creating a relatively large dataset from a single online source,the aim of the BIMCV dataset was to eliminate this bias and obtain more realistic results. Further, by adhering to a strict RT-PCR ground-truth for each CXR, a dataset with highly reliable labels was created. However, it has to be taken into account that the RT-PCR test has high specificity but a moderate sensitivity rate, and so an unknown percentage of false negatives might still be present in the final dataset. 71,72 The latter represents an almost unavoidable obstacle in the (semi-)automatic creation of very large COVID-19 datasets required for DL unless a reliable amount of additional and structured metadata is available on the patient's symptoms and follow-up tests. 73 Further, by adhering only to RT-PCR criteria, the BIMCV dataset likely contains a percentage of mild COVID-19 positive cases with limited symptoms and no radiological signs. 74 This might partly explain the lower COVID-19 classification performance obtained in this study on BIMCV compared to similar studies on other datasets. However, we believe BIMCV represents a clinically realistic scenario when applying CXR for screening and diagnosis, as not all patients will present with severe COVID-19 pneumonia. This however also implies that automatic COVID-19 diagnosis using CXR and DL has limited sensitivity (range:0.59-0.71),in combination with low specificity (range: 0.56-0.76). Furthermore, a qualitative sanity check revealed the NNs do not focus on relevant information in the CXRs. These results indicate that a quantitative external validation alone might not be sufficient to ensure a NN relies on medically relevant pathology, as also concluded by DeGrave et al. 47 By segmenting the lung regions as an additional pre-processing step before feeding the CXRs to the classification networks, CNNs were forced to look at relevant parts of the anatomy only and generalizability improved slightly. However, COVID-19 sensitivity and precision remained below 70%. As the BIMCV dataset is publicly available, the pre-processing steps mentioned in this study, including lung segmentation, can be followed to create a relatively large and reliable dataset with a low risk of bias for further CNN development.
Improvements can be expected through a number of approaches such as the optimization of NN architectures and/or the incorporation of clinical patient features such as COVID-19 specific symptoms in the final NN decision making. 75,76 Additional improvements can be expected from the availability of more standardized, large-scale, and qualitative datasets, provided in medical image standards such as DICOM so differences in overall intensity values (e.g., contrast, noise, etc.) can be eliminated. In addition, novel data augmentation techniques such as those using generative adversarial networks to simulate pathology in existing CXRs or render completely synthetic CXRs could create larger and more balanced datasets. [77][78][79] Another approach is presented by Ahmed et al., who propose fine-tuning on unseen data to improve the performance at a new site. 80

CONCLUSIONS
Over the last 2 years, the AI community has presented several automatic screening tools for COVID-19 based on CXR, with reported accuracies often well over 90%. However, it has been noted that many of these studies have likely suffered from dataset bias, leading to overly optimistic results. This study confirms that when trained on a combinatory dataset, CNNs tend to learn the origin of the CXRs rather than the presence or absence of disease, a behavior known as short-cut learning. The bias is shown to originate from differences in overall pixel values rather than embedded text or symbols, despite consistent image pre-processing. When trained on a reliable, and realistic single-source dataset in which non-lung pixels have been masked, CNNs currently show limited sensitivity (<70%) for COVID-19 infection in CXR.

AC K N OW L E D G M E N T
This research was funded by Health-Holland, Top Sector Life Sciences & Health (grant TKI-LSH-T2019-SmART-DETeCT).

C O N F L I C T O F I N T E R E S T
The authors declare that they have no conflict of interest.

DATA AVA I L A B I L I T Y S TAT E M E N T
The data that support the findings of this study are available in the following repositories in the public domain: BIMCV-COVID19: https://bimcv.cipf.es/bimcvprojects/bimcv-covid19/ COVIDx: https://github.com/lindawangg/COVID-Net/ blob/master/docs/COVIDx.md