WG 2 AN: Synthetic wound image generation using generative adversarial network

In part due to its ability to mimic any data distribution, Generative Adversarial Network (GAN) algorithms have been successfully applied to many applications, such as data augmentation, text-to-image translation, image-to-image translation, and image inpainting. Learning from data without crafting loss functions for each application provides broader applicability of the GAN algorithm. Medical image synthesis is also another ﬁeld that the GAN algorithm has great potential to assist clinician training. This paper proposes a synthetic wound image generation model based on GAN architecture to increase the quality of clinical training. The proposed model is trained on chronic wound datasets with various sizes taken from real hospital environments. Hyperparameters such as epoch count and dataset size for training tasks are studied to ﬁnd optimum training conditions as well. The performance of the developed model was evaluated through a mean squared error (MSE) metric to determine the similarity between generated and actual wounds. Visual inspection is performed to examine generated wound images. The results show that the proposed synthetic wound image generation (WG 2 AN) model has great potential to be used in medical training and performs well in producing synthetic wound images with a 1000-image training dataset and 200 epochs of training.


INTRODUCTION
Computer-human interaction has an essential impact on facilitating knowledge generation, dissemination, access, and utilization [1]. The study of this interaction has evolved by broader terms into the active research domain of machine learning (ML) or Artificial Intelligence (AI) over the past 60 years [2].
A key element of successful deep learning depends on the availability of massive amounts of data [3]. Deep learning (DL) applications in healthcare are found to be lagging, in large part, due to the high cost of accurate and tagged data collecting. The total cost of data management in healthcare could rise to millions of dollars [4], which affects medical training as well. Medical students' education costs are high, with many instances of not enough data to showcase the pathologies, wounds, and diseases. In addition, class imbalances hinder the performance of AI models, such as decision trees (DT), neural networks (NN), and support vector machines (SVM) [5]. Limited generalization ability and overfitting problems are too common without reliable training datasets utilizing neural networks with millions of parameters [6]. One could say that data augmentation techniques could be used to prevent overfitting and improve imbalanced class problems [7]. However, augmented data still resembles the original image intuitively [8]. This is why the quantity of training data is crucial to train and validate a model utilizing machine learning techniques [9].
Determining optimal hyperparameters is critical to AI models. However, the process is largely considered a black art [10]. Training dataset and the number of epochs play an important role while building new types of AI models, that is data-hungry, deep learning models. The number of epochs to train a model could be determined through loss functions in traditional machine learning models, whereas a Generative Adversarial Network (GAN) algorithm provides complicated loss functions. Hence, finding the optimum number of epochs is also a critical parameter that requires labour, time, and expertise in the healthcare field [11,12]. In the scope of this study, the performance of the synthetic wound generation was examined extensively regarding the number of training epochs and the training dataset size. Six different wound datasets were created to measure the effect of dataset size, and the model was trained for 2000 epochs. Results of every five epochs till the 2000th epoch was compared to analyse the effect of epoch count in detail.
In the literature, there are limited studies using the GAN algorithm in the healthcare domain. A conditional Wasserstein GAN framework was introduced for Electroencephalography (EEG) data augmentation to improve emotion recognition [13]. Magnetic Resonance (MR) image generation using Wasserstein GAN was presented for data augmentation and physician training [14]. In the study by Antoniou et al. [15], a data augmentation GAN model was developed to generate broader data augmentation techniques such as mirroring and rotation. Synthetic data was generated through GAN models to augment the dataset with synthetic images to maximize the performance of the classifier [16] and to improve the classification task on blood cells [17]. Authors in [18], propose balancing GAN (BAGAN) to balance class distribution while training the GAN with all available images and generate synthetic underrepresented class images utilizing conditioning in the latent space. A study by Wang et al., introduces the generation of belittled class plankton images while training the GAN with belittled class images and classify plankton types using a shared convolutional neural network (CNN) layers with a discriminator network [19]. The authors in [20], propose a training scheme that implements classical data augmentation techniques to enlarge the Computed Tomography (CT) images of the liver and then enlarge the dataset a second time through synthetic image generation with GAN using classically augmented data for training. The study in [21], investigates different GAN architectures, such as Super Resolution GAN (SRGAN) and DualGAN, to generate CT images from real Magnetic Resonance Imaging (MRI). The Authors in [22] propose an autoencoder combined generative adversarial network to synthesize jellyfish using a small dataset compared to other GANs. The study in [23] investigates the generation of synthetic fundus images of age-related macular degeneration (AMD), which are indistinguishable from real ones using progressively grown generative adversarial networks (ProGANs).
The aim of this paper is to utilize a GAN architecture to create a novel model that generates synthetic wound images, which was not done previously. This can help the training of medical students and clinicians on wound type and wound stage prediction more accurately. This study also presents a criterion for the optimum number of training images and epochs for generating wound images using GAN algorithms.
The paper is organized into five sections. Section 1 introduces the proposed algorithm details and states the problem. Section 2 presents the methodology employed. The workflow for data collection, dataset preparation for processing, environment, and validation steps are presented in Section 3. Experiments using the various training datasets and epoch counts are • Development of a model using the GAN algorithm in a conditional setting to generate synthetic wound images. • Use of Mean Squared Error (MSE) metric to compare similarity between generated and actual wound images. • Validate and evaluate model performance visually.
• Develop a hyperparameter selection guideline that can be utilized by healthcare researchers while training medical AI applications.

METHODOLOGY
GAN has the ability to mimic any data distribution with the help of adversarial generator and discriminator networks. The minimax optimization between the adversarial generator (G) and discriminator networks (D) is at the heart of the GAN architecture. Both networks are trained simultaneously with real data to learn the data distribution evenly. The generator network gets uniform or Gaussian noise (z) as an input and produces fake images (y), which are fed to the discriminator network so as to be classified as real or fake, D: {y } → [0,1]. The generator and the discriminator networks are trained once with the training set. Following this, the generator network is used to generate fake images that are not differentiable from the training set. Figure 1 shows the basic structure of the GAN model. There are various versions of the GAN algorithm for different applications. For our study, Vanilla GAN, DCGAN, CycleGAN, and cGAN based versions were examined. The Vanilla GAN and DCGAN based models for synthetic wound generation were not suitable for our study as generated wound images were not compatible with the tissue segmentations. On the other hand, CycleGAN based model was promising due to its ability of unpaired image-to-image translation. The initial trials with the CycleGAN model revealed that the generated synthetic wound images experience a mode collapse problem, a well-known complication in the GAN field, which causes the synthetized wound images to look similar despite different tissue segmentation inputs [24]. As these models failed to generate proper wound images coherent to the given wound tissue segmentation images, these different GAN approaches, namely Vanilla, DCGAN, and CycleGAN based models, are discarded from this study due to mentioned issues above.
A cGAN based architecture is used for conditioning the output (generated image), for specific data distributions, with an input image (x) and the addition of a noise (z) factor. In order to condition the output of the generator, the input is given to the generator. The input image is the segmented image ground truth that is gathered from eKare LLC. The discriminator is also fed with the input segmented image and the real data to distinguish whether the generator performed meaningful image synthesis or not.
The generator network learns a mapping from the input domain to the output domain, G: {x, z } → y. The discriminator network learns a loss function to train this mapping [25] and tries to differentiate the fake image (y) from the real one, D: {x, y } → [0,1]. This architecture is used to generate outputs that are similar to the input. This architecture gets paired and aligned images as an input to generate look-alike images. Both networks, the discriminator and generator, observe the input. Figure 2 shows the general architecture of the model that is fed with noise and input data. Loss coming from discriminator is fed to generator.
The discriminator network has two losses, namely D_real and D_fake, which indicate the ability of the discriminator network to differentiate the real and the fake images. Another loss used for the generator loss is G_L1, which compares the generated fake image and the real image to generate more plausible fake images. Adversarial loss ( Adv ) coming from the discriminator network together with L1 loss ( L1 ) of the generator network (G_L1) used for generator training causes complicated objective functions for the generator network. This is why the process of training a GAN architecture is hard to interpret.
L1 loss behaves as a regularizing term with a hyperparameter, lambda (λ), which is chosen as 100. The final objective function is an analogy of a minimax game where the generator tries to minimize, and the discriminator maximizes an adversarial objective. This relationship is depicted as: The hyperparameter problem was examined through various experiments. To measure the effect of dataset size, six (6) training datasets were created. For each of these datasets, a model was trained for 2000 epochs. The performance of the model was measured every five epochs through MSE metric for extensive analyses of the hyperparameter selection. The significance of the hyperparameter is examined via visual inspection as well.

DATA COLLECTION, PRE-PROCESSING, ENVIRONMENT, AND VALIDATION
This section discusses data collection, data pre-processing, the simulation environment, and the validation methods.

Data collection
Chronic wound images were provided by eKare, Inc (Fairfax, VA). Chronic wounds of various types, that is diabetic, burn, lymphovascular, pressure injury, and surgical, are anonymized and semi-automatically labelled for training and testing [26]. The diversity of chronic wound types increases the applicability of the model. This dataset includes around 4100 wound images ranging in size from 1224 × 1224 to 2160 × 2160 pixel dimensions (depending on the camera used to acquire measurement). Wound images were acquired under normal ambient room lighting by clinical users with commercially available 3D wound cameras.

Pre-processing
Chronic wound images are rescaled to 512 × 512, with the original wound and its labelled pair concatenated to form a 512 × 1024 pixel image. Before concatenation, the background of the labelled image is cleaned and made white. One hundred wound images out of the 4100 are chosen for the test dataset. The training dataset constitutes the rest of the dataset. To study the effectiveness of the dataset size, training datasets with sample sizes of 100, 250, 500, 1000, 2000, and 4000 were created. Normalization and formatting were done to ensure high performance during training.

Environment
The proposed synthetic wound image generation using the GAN  The Adam optimizer with a learning rate of 0.0002 is utilized for the first half of the training, while the rest is done with a linearly decaying learning rate to zero. The Adam optimizer is used as it converges faster, and also lower loss values are generated for the generator network in comparison to other optimizers in GAN applications [27]. The learning rate setting is chosen in line with the previous works done in image synthesis using GAN algorithms [28].

Validation
Validation of the proposed method is done with the mean squared error (MSE) metric, which is a pixel-wise error measurement. Three colour channels (RGB) are considered while calculating the MSE scores where Y R , Y G , Y B are used for the colour channels of real images, and Y' R , Y' G , Y' B are used for the colour channels of generated images. Additionally, n denotes the number of pixels.
The MSE metric is applied to each pixel in the fake and real images. Each pixel has three channels, namely red, green, and blue. Each corresponding pixel in fake and real images are compared, and the MSE score is calculated. The MSE metric provides better quality scores [29]. Converging and meaningful results for the synthetic wound generation task is obtained using the MSE metric. The synthetic wound results are expected to be similar to actual wound images where the MSE metric could evaluate such similarity properly.

EXPERIMENTS
Model and loss function outputs, the effect of dataset size, and epoch count on synthetic wound generation are discussed with MSE score and visual inspection in this section. Additionally, the optimum training conditions are determined with hyperparameter tuning.

Model output
The proposed model gets wound tissue segmentation and produces synthetic wound images. Wound segmentation is done with respect to wound tissue types, namely necrotic (blue colour), granulation (red colour), and slough (yellow colour). This classification is of foremost importance to the caregiver while assessing a wound and deciding on a treatment plan. A successful result from the WG 2 AN model is depicted in Figure 3, which indicates proper synthetic wound generation by training with 4000 images and 2000 epochs. Actual wound, Figure 3B, and synthetic wound image, Figure 3C, were presented without background for a better comparison with the segmentation input, Figure 3A.
In addition to the original wound image, a combination of the synthetic wound and original wound background is shown in Figure 4A-B. The generated image has characteristics of a real wound.

Effect of number of images and epochs on model loss
The chronic wound dataset is pre-processed and arranged as six (6) subgroups, which have 100, 250, 500, 1000, 2000, and 4000 images. The proposed model with WG 2 AN architecture to synthesize wound images are trained for 2000 epochs. The effects of dataset size and epoch count are examined to find the optimum training dataset.
The G_L1 loss curves of the model with different training dataset sizes are depicted in Figure 5. The adversarial and the discriminator network losses, that is D_fake, D_real, do not provide useful information, and they are not included in Figure 5. The proposed model could not settle down in the first 75th epochs as the losses increase, which is a common problem because of the randomized weights in the neural networks. The loss curves also share a similar fluctuating but decreasing pattern. The general zigzag behaviour of the loss curves is a result of the alternating wound image samples during training. The proposed model moves along the different training samples without reaching an equilibrium, which causes this oscillation [30].
The loss curve of the model with 100-images has the lowest loss. The model with 250-images has the second-lowest loss values. Other curves follow this characteristic as well. It can be deduced that models with fewer training images produce lower loss values. The loss graphs are compared to evaluate the learning ability of the model in The 4000-image model has trained for 4000 epochs; however, it does not stabilize under 15 and oscillate around it. The drop rate of the G_L1 loss indicates that our proposed model learns the data distribution faster with a smaller dataset. The models with a larger training set have more complicated data distributions that take more epoch to mimic.
As stated in (2), the generator and the discriminator networks compete over a min-max game, which results in a nonconverging problem [25]. Although G_L1 loss has a meaningful curve that could be used to determine if the model is learning the training data distribution, it lacks the contribution of the adversarial loss. That's why an additional evaluation metric is needed, that is MSE. MSE score provides a better understanding of the generation of synthetic wound images by the proposed WG 2 AN model if they look like real wounds or not.

Effect of number of images and epochs on synthetic wound generation
The effect of training parameters, that is dataset size and epoch count, are discussed in this section. The models with different training dataset sizes (100, 250, 500, 1000, 2000, and 4000) are evaluated visually and using the MSE score. Figure 6 indicates the actual wound and the segmented wound, which is the input of the model. The outputs of four different models are compared in Figures 7-12 with respect to epoch counts (200, 500,  1000, and 2000).  The model with the 4000-image dataset is trained for 4000 epochs, but the effect is negligible. That's why we cut training at the 2000th epoch for each model. The generated images before the 75th epoch lack the details and texture of a wound in general and suffer from the checkerboard effect significantly. The up-sampling layer in the generation pipeline of a GAN model, which produces high-resolution images from low-resolution ones, causes checkerboard artefacts [31]. The checkerboard pattern emerges when deconvolution has uneven overlap [32].
The models with smaller training datasets tend to be biased towards a certain data distribution, which is the case for 250 and 500-image models, which produced darker wounds. After further training of the same model, dataset limitation recovered at the 500-image model at 2000th epochs, as seen in Figure 9D. Training the models longer provides detailed texture and balance in the distribution of the tissue (also visible in Figure 9D).
The 100-image model generated insufficient wounds for the 500 epochs and the lower epoch counts. However, increasing the training to 1000 epochs improved the generated images significantly. Results for the 100-image model in Figures 7C   FIGURE 10 WG 2 AN output using 1000 images for different epochs. The 250-image model generated primitive wounds at 200 epochs in Figure 8A. Generated wounds have more balanced wound tissue characteristics with some limitations, such as overall darker colours at 500, 1000, and 2000 epochs in Figure 8B-D. The 200 epochs of training is not enough for a life-like wound image generation, as seen in Figure 8A.
The model with the 500 images produced more life-like wounds at 200 and more epochs of training, shown in Figure 9C and D. The result of 500 images at 200 epochs in Figure 9A does not represent a real wound.
The model with 1000 images exhibits better performance than previous models. At 200 epochs of training in Figure 10A, it shows a reasonable output where previous models just produce an elementary output from fed segmentation data. The results of 500, 1000, and 2000 epochs of training produce better-wound images, as seen in Figure 10B   in Figure 11A also form a wound with close to wound tissue characteristics.
The model with 4000 images generates life-like wounds at 200 epochs to 2000 epochs in Figure 12A-D. Every generated sample carry characteristics of the wound and well-balanced tissue distribution.
By varying the number of epochs and dataset size, generated image quality has improved significantly. It can be inferred that the lack of data could be substituted by further training of the model.

Evaluation of the model with MSE metric
The model performance is evaluated by the MSE metric, as summarized in Table 1, which indicated the MSE scores of the models with different training images and epochs. A comparison of the MSE scores can be examined in Figure 13. MSE scores are calculated on individual images, after which the median average is calculated. A comparison of the MSE scores provides guidance for the generated image quality. A lower MSE score means better-generated image quality. The overall trend of the MSE scores in Table 1 is decreasing with expanding dataset size and further training of the models. The first 75 -100 epochs of training are necessary for the model to settle. The models' MSE scores for 100 epochs are also included in Table 1 Figures 10C and 10D. The slough area gets smaller in the former one, which causes a slight MSE increase. Models with a higher number of training images, that is 2000 and 4000, follow a similar pattern with the 1000-image model.
MSE score of the 5th epoch results is negligible as generated images are not good enough to mimic wound texture and detail. Training the model with a smaller dataset for a longer number of epochs results in overfitting, that is an increase in MSE score. This can be seen in Figure 13, especially after the 500th and 200th epochs. It is confirmed that increasing dataset size and epoch count until the 500th epoch gives a better MSE score. For the first 200 epochs of training, the decrease in MSE score is so significant that later increasing of training has minimal effect on the score. Therefore, 200-epoch is the optimum epoch number for training the proposed method. The amount of dataset size to generate plausible images is 1000 because the 1000-image model generated most life-like wounds, as seen in Figure 10A. That's why we conclude that the 1000 images and 200 epochs of training are the optimum training hyperparameters for plausible wound image generation.

DISCUSSION
The proposed method generates synthetic wound images that have an input of segmented tissue outline, shown in Figure 2.
The output is compared with the ground truth wound images. Two views of a successful result from the model are presented in Figures 3 and 4, indicating optimal synthetic wound generation from the segmentation of a wound by training with 4000 images and 2000 epochs. The generated wound image is combined with the rest of the limb. The generated wound tissue has a life-like structure, proper colour, and detailed texture. The colours of the wound tissues are well-matched and conformable. Figure 5 shows the loss graphs of L1 loss of models with the different training datasets. The loss curves have similar These extensive investigations suggest that the optimal performance for synthetic wound generation could be achieved using 1000 images and 200 epochs of training.
On the study results, the following observations regarding the application of the WG 2 AN model could be made.
• Observation 1: The proposed model can perform synthetic wound generation from provided wound segmentation. • Observation 2: WG 2 AN has a high potential of producing close to real synthetic images. • Observation 3: The quality of the generated images is in line with the image count. 1000 image count is the threshold for a valid generated image as the results of our study. • Observation 4: The epoch count has a significant impact on the generated image quality. Yet, after surpassing a 200-epoch threshold, the model reaches its convergence, and additional training has a marginal effect on the quality of the generated image. • Observation 5: The WG 2 AN model can generate detailed tissue texture. • Observation 6: Lack of training can be solved by increasing the training dataset. • Observation 7: Scarcity of training images can be mitigated to some extent by further training of the model.
• Observation 8: Generated wound images can be combined with any part of the body to demonstrate the wound characteristics at that body location.

CONCLUSION
This paper compiles a synthetic wound image generation using the proposed WG 2 AN architecture. Synthetic wound generation using GAN is implemented for the first time in the literature. The wounds are segmented with respect to tissue types using semi-automatic machine learning techniques. The proposed model is then fed with segmented wound images to generate synthetic wounds. With respect to patient privacy and lack of enough dataset for wound images in healthcare, the generation of synthetic anonymous wound images could enable further studies in AI, which could improve the adoption of AI in clinician training. L1 loss is also examined in order to understand the impact of training dataset size and epoch count, where the loss curve does not reveal much information other than models' learning behaviour. The generated images are examined and compared visually to evaluate their resemblance to a real wound. It can be concluded that the hardship of finding adequate training images in healthcare can be mitigated by additional model training. The effect of dataset size is also evaluated visually for further analyses. An increase in the training dataset brings in more life-like wound images.
The results of different dataset sizes and epoch count are evaluated through the MSE metric to compare the generated image. The generated images are expected to be very similar to actual wound images. This is why the MSE metric gives sufficient guidelines when comparing actual and synthetic images. The proposed model confirms that the 1000-image model with 200 epochs of training yields optimum results. Increasing these parameters could provide better synthetic images.
As future work, the segmentation of the wound could be done in more detail. In addition, the use of the proposed model in an education package could bolster the performance and practice of medical training.
Balancing underrepresented classes with synthetic image generation could help with the adoption of AI in the healthcare industry, where sourcing context-specific data is expensive. It is expected that this study can provide a handy clinician tool for generating and interacting with live wound models.