Evaluation of deep learning for COVID‐19 diagnosis: Impact of image dataset organization

Abstract Introduction Coronavirus disease 2019 (COVID‐19) has spread all over the world showing high transmissibility. Many studies have proposed diverse diagnostic methods based on deep learning using chest X‐ray images focusing on performance improvement. In reviewing them, this study noticed that evaluation results might be influenced by dataset organization. Therefore, this study identified whether the high‐performance values can prove the clinical application potential. Methods This study selected chest X‐ray image databases which have been widely applied in previous studies. One database includes images for COVID‐19, while the others consist of normal and pneumonia images. Then, the COVID‐19 classification model was designed and trained on diverse database compositions and evaluated using confusion matrix‐based metrics. Also, each database was analyzed by graphical representation methods. Results The performance was significantly different according to dataset composition. Overall, higher performance was identified on the dataset organized with different databases for each class, compared with the dataset from same database. Also, there were significant differences in the image characteristics between different databases. Conclusions The experimental results indicate that model may be trained based on differences of the image characteristics between databases and not on lesion features. This shows that evaluation metrics can be influenced by dataset organization, and high metric values would not directly mean the potential for clinical application. These emphasize the importance of suitable dataset organization for applying COVID‐19 diagnosis methods to real clinical sites. Radiologists should sufficiently understand about this issue as actual user of these methods.

In this urgent moment of rapidly growing cases, one of the most important ways to control its spread is early and rapid diagnosis of the disease.
The most commonly used testing method is a reverse transcription-polymerase chain reaction (RT-PCR). This method diagnoses COVID-19 by detecting the presence of specific nucleic acids corresponding to SARS-CoV-2. However, it has low sensitivity of 60%-70% and takes a relatively long time to obtain the test result. [4][5][6] RT-PCR has been improved following the efforts of many researchers; [7][8][9] nevertheless, it continues to have limitations. Furthermore, there are shortages of RT-PCR test kits and reagents to deal with the many tests required and the test is expensive. This situation is especially more serious in countries with private health insurance systems or limited medical service systems.
The chest X-ray image can be an alternative method for COVID-19 diagnosis by detecting radiographic features including consolidation, ground glass opacities, and nodules. [10][11][12] It has high accessibility because it is one of the most standard equipment in medical institutions and it is relatively cheap compared with other methods. Furthermore, it can be used in isolation rooms due to its portability. 6,13 Lately, a number of studies have proposed diagnosis algorithms using chest X-ray images based on deep learning 5,14-17 by mostly focusing on the performance improvement. For that, they organized dataset using diverse public databases and evaluated the model performance. In those experimental results, this study noticed that there was different aspect of results according to the databases and this could mean that evaluation might be influenced by the dataset organization. For example, the classification accuracy was relatively high with compositions of the specific databases. Therefore, this study realized that database should be analyzed to organize the suitable dataset and verify the clinical effectiveness of the diagnosis model for clinical adaptation to real medical system.
In deep learning, the dataset is considered as one of the important elements. For example, the size and class ratio of the dataset may impact the training process and bring positive or negative consequences. 18,19 Also, the classification results of the model can be affected by characteristics of the database. Even if the model structure is the same, a different result may be produced depending on the composition of the image dataset.
The goal of this study is to analyze the impact of the image dataset on deep learning to diagnose COVID-19 and propose the importance of the organization of the image dataset for reliably verifying the potential for clinical application of the method. To the best of our knowledge, this is the first study that carries out a deep analysis of diverse databases related to COVID-19. For this purpose, this study selected three kinds of chest X-ray databases which have been widely applied in previous studies. Then, the deep learning model was designed to classify the diseases on chest X-ray images. The classification accuracy was calculated on different compositions of the database, and they were compared.

2.A | Chest X-ray image datasets
The chest X-ray image datasets were obtained from three different public databases which were used in many previous studies. One database includes images for COVID-19 patients, and other two databases consist of normal and pneumonia images. This study will name these databases as IEEE8023, NIH X-ray, and Chest X-ray2017 for convenience. The details of each database were shown in NIH X-ray 22 database was provided from the National Institutes of Health, and it is also referred to as Chest X-ray8. It consists of the frontal view X-ray images of normal and several disease states including pneumonia, atelectasis, and cardiomegaly acquired from 30 805 patients with average age of 47. The images were labeled by using natural language processing. Chest X-ray2017 was collected and labeled by Kermany et al. 23 It includes a total of 5232 chest X-ray images which were obtained from children. This study randomly selected 884 normal and 884 pneumonia images for each of two databases. These two databases can be downloaded from online. 24  In this study, there were not enough chest X-ray images, because COVID-19 is a new disease. Although images are constantly being updated, they are still not enough to train the deep learning models. To overcome the problem which derives from an insufficient dataset, the classification models were pretrained on the large-scale ImageNet 32 database and the acquired weight values were reused for training the model on the chest X-ray dataset.
The weight values were fine-tuned to better suit with the COVID-19 classification.
To classify the X-ray images into normal, pneumonia, or COVID-19, the classifier was also designed based on convolution method. In The framework of COVID-19 diagnosis model. most previous models, the fully connected layer was used for classification step. However, this structure converts image data into onedimension, and it can make model to ignore the spatial features in training process. Therefore, this study designed convolution-based classifier to preserve the spatial structure of the image. This classifier involves three convolution layers. First and second convolution layers consist of 3 × 3 convolution, batch normalization, activation, and dropout. In the last convolution layer, 1 × 1 convolution and global average pooling were used to set the number of output to the number of classes. Also, ELU 33 was used for activation function to efficiently train the classification model without the vanishing gradient problem. each class. Therefore, the class with a high probability was designated as a predicted label then binary value was derived to calculate the metrics.

2.C | Performance evaluation
This study also examined the distribution of the image datasets based on principal component analysis (PCA) 34 and t-distributed stochastic neighbor embedding (t-SNE) 35 to analyze the characteristic of the X-ray images on each of the databases. This study identi-

3.A | The classification results for COVID-19 with normal or pneumonia on different data composition
First, the five models were trained to classify the COVID-19 and other classes. Table 3 is the binary classification results for normal and COVID-19, and Table 4 shows the results of pneumonia and COVID-19. In each Table, higher metrics value indicates higher performance. On NIH X-ray, the classification accuracy was from 0.88 to 0.97 for normal and COVID-19, and it was higher on Chest X-ray2017 above 0.98 (Table 3). Also, the pneumonia and COVID-19 were classified with average accuracy of 0.91 and 0.99 on the two databases, respectively (Table 4). In most models, the experimental results were higher on Chest X-ray2017 than the NIH X-ray.

3.B | The classification results for normal and pneumonia on different data composition
To determine the classification results according to the dataset composition, four scenarios were designed by combining the NIH X-ray and Chest X-ray2017 for classification of normal and pneumonia as shown in Table 5. The experimental results demonstrated that when the different databases were combined for each class, the classification results were higher than when the same database was used for two classes. For example, in the VGG19, the classification accuracy was 0.97 when the Chest X-ray2017 was applied for normal and pneumonia, while it was 1.00 in two scenarios with different databases for each class. Similar results were also found for ResNet50.

| 301
Chest X-ray2017 were applied for normal and pneumonia datasets.
When the model was trained with the NIH X-ray, Chest X-ray2017 was used as the test dataset. The opposite procedure was also performed. Table 6 presents the classification results of cross-training evaluation. In Table 6, the N and P on the class column indicate normal and pneumonia. The overall accuracies were under 0.75. Moreover, in some cases, the specificity was not even over 0.1. These values are significantly lower than when the models were trained and tested with same kind of database listed in Tables 1 and 2.

3.D | The visualization results of dataset
To analyze the image characteristics according to the databases, the dimensionality was reduced to two dimensions. Then, the results were represented into two-dimensional graphs. Fig. 2 is the result acquired using PCA, and Fig. 3 is for t-SNE. The datasets were from the IEEE8023, NIH X-ray, and Chest X-ray2017 databases. Each of the NIH X-ray and Chest X-ray2017 databases was divided into two types according to the classes (normal and pneumonia). In Figs 19. This occurred because COVID-19 has more distinct differences from normal than it does from pneumonia on chest X-ray images.
COVID-19 image possess lesions like ground glass opacity and consolidation, and these can also be captured on pneumonia images. 5 Therefore, differentiating of COVID-19 from pneumonia may be harder than normal. Secondly, the classification results were distinctly different according to the dataset type. Overall, the models well classified on Chest X-ray2017 with high accuracy compared with NIH X-ray. This demonstrates that the classification performance can be affected by the dataset organization.
T A B L E 5 The classification results for normal and pneumonia on different data compositions. This study also compared the classification results for normal and pneumonia on the different dataset compositions. The classification results were relatively higher when different two databases were used for each class than organization with the same database as shown in Table 5. This implies that the model may be trained based on the different characteristics of NIH X-ray and Chest X-ray2017.

Model
In other words, when these two databases, respectively, used as normal and pneumonia dataset, the model may be trained to distinguish each dataset not the pathological feature. And this suggests that the model was perhaps not well trained, although it shows high performance on the evaluation metrics.
For further analysis, cross-training was performed by using different databases for training and testing as shown in Table 6. The results were worse than using the same types of database for training and testing (Tables 3 and 4). These show that the model trained on the NIH X-ray shows poor performance on the Chest X-ray2017, and vice versa. These results demonstrate the low robustness of the model, and it presents that the model was trained more to be focused on the different image characteristics between databases.
This study also analyzed the image characteristic of databases by representing it into two-dimensional graphs using PCA and t-SNE. In the case of the databases which were applied in this study, the Chest X-ray2017 was originally put together for developing a pneumonia diagnosis system for children under 5 yr old. 23 It involves large numbers of X-ray images and provides detailed information.
However, this high-quality database may not accurately fit the COVID-19 classification problems. Because COVID-19 classification is not for only children. The IEEE8023 database consists of X-ray images of patients of various ages from infants to the elderly. Also, x-ray images of each database were acquired in different environment and parameters. It is considered that these led to the differences in image characteristics between databases, and the model was may trained based on this and not the lesion features. Furthermore, NIH X-ray was labeled based on natural language process not manual process. 22 The accuracy is only about 90%, and this point may affect the experimental results of the X-ray image classification.
COVID-19 has brought many changes. It has become very difficult to move across borders, and the disease has severely influenced economies and daily life in many countries. There have been many attempts to diagnose COVID-19 using chest X-ray images, and this can be innovative solution for rapid and cheap diagnosis of COVID-19. However, they still need to be verified for their clinical effectiveness to be applied to real clinical sites. And for this, more efforts are

CONF LICT OF I NTEREST
No conflicts of interest.

D A T A A V A I L A B I L I T Y S T A T E M E N T
These data were derived from the following resources available in the public domain: (https://github.com/ieee8023/covid-chestxray-da taset, https://nihcc.app.box.com/v/ChestXray-NIHCC, and https://da ta.mendeley.com/datasets/rscbjbr9sj/2).