Automated renal segmentation in healthy and chronic kidney disease subjects using a convolutional neural network

Total kidney volume (TKV) is an important measure in renal disease detection and monitoring. We developed a fully automated method to segment the kidneys from T2‐weighted MRI to calculate TKV of healthy control (HC) and chronic kidney disease (CKD) patients.


| INTRODUCTION
Segmentation of the kidneys from MRI is a time consuming aspect of many renal MRI studies. [1][2][3] Total kidney volume (TKV) gives insight into renal function and is therefore used as a measured parameter for a variety of renal pathologies. The use of TKV is an active area of ongoing research for autosomal dominant polycystic kidney disease (ADPKD), which is characterized by an increase in TKV as a result of cyst formation. Disease progression can be monitored by recording TKV, with higher rates of TKV increase being associated with a more rapid decrease in renal function. [4][5][6] Measurements of TKV in chronic kidney disease (CKD) subjects have shown a significant correlation with glomerular filtration rate, 7 the primary measure of CKD severity, 8 with more generally a decrease in TKV associated with a decrease in renal function. 9 When studying pathologies, which commonly lead to a change in kidney function, total kidney perfusion is often measured, this metric relies on an accurate measurement of renal blood flow and kidney volume of each kidney, and allows investigators to ascertain if the blood flow is preserved as the organ changes in size or if tissue perfusion is impaired. In addition to TKV measurements, renal segmentation is an important first step in many other processing pipelines, for example, increasing the accuracy of corticalmedullary segmentations or reducing computation times by only fitting quantitative maps for voxels within the kidney.
The gold standards of kidney segmentation are manual (region-of-interest) ROI boundary tracing 10 or stereology 11 by experienced and skilled experts, with blood vessels in the kidney and the hilum excluded. These manual processes are highly time consuming (taking ~15-30 min per subject [12][13][14] and can be biased by investigator judgement because of the similar signal intensities between the kidneys and surrounding organs, anatomical differences between subjects, cysts, and image artefacts. Consequently, the resulting kidney ROIs produced are subject to intra-and inter-expert variability as a result of the varying expertise levels; experts may segment a specific image differently when performed more than once, or different experts may segment the same image differently. These factors mean that the development of a faster and ideally fully automated method of renal segmentation is highly desirable. However, the same factors that make manual segmentation difficult can also limit fully automated methods, for example, the signal intensity of the kidneys closely matches that of other abdominal structures such as the spleen. A number of automated methods have been proposed with varied success. 12 Some simply assume the kidney is an ellipse and calculate the volume from measurements of the pole-to-pole distance 15,16 or include a correction factor to reduce overestimations. 17 Unfortunately, these techniques produce a large confidence interval and still require human intervention to define the pole-to-pole length, a process that can produce inconsistencies between readers and takes a reasonable amount of time (~5 min). 18 Other semi-automated methods use classical image processing techniques such as thresholding, 19 water-shedding, 20 level sets, 14,21 and spatial prior probability mapping. 22 These methods can either be inaccurate, over-segmenting the kidneys, or include a number of parameters that need to be manually adjusted and are computationally intensive. Further, the fact that each technique is highly optimized for a specific data set means that it needs to be re-written to be applied to different pathology, which is another time consuming and highly skilled process.
Machine learning methods have the potential to automatically detect different patterns from data given to a model that has been trained. Deep learning is a class of machine learning algorithms that can model high-level information in an image using several processing layers of transformations. This uses an architecture of multi-level linear and non-linear operations, described by layers, to learn complex functions that can represent high-level detail to map the input data to the output segmentations directly. As more data becomes available the algorithm can become more accurate and generalized, without a need to rewrite the underlying methods, therefore making it a good choice for long-term development.
In recent years, deep learning-based methods have been applied to the segmentation of medical images, especially successful has been the U-Net. 23 This modified fully convolutional neural network (CNN) architecture uses a number of convolution, pooling, and upsampling layers to detect features in the input data at multiple resolutions. The convolution layers convolve a learnable kernel with the input data to generate spatial feature maps that are passed to subsequent layers in the network. By adjusting the kernels, the resulting feature maps can be optimized to detect the location of the kidneys. Pooling layers are used to downsample the data and allow some convolution kernels to become tuned to approximate features, this also reduces the tendency of the network to over-fit the training data. When the data has been fully downsampled, upsampling layers are used to increase the resolution of the feature maps back to that of the original data while more convolution layers also learn the precise location of the kidneys. Parameters are adjusted by comparing the output from the network to a known ground truth. CNN methods have been applied to segmentation in other areas of medical imaging, [24][25][26][27] for example, to prostate segmentation of MRI images, 28 liver segmentation of x-ray CT images 29 and segmentation of polycystic kidneys. [30][31][32] However, to date, these methods have not been successfully applied to CKD and healthy kidney segmentation from MR images.
Here a single 2D U-Net model CNN is used for the segmentation of the kidneys in both healthy control (HC) participants and CKD patients using T 2 -weighted MR images. Automatically generated kidney masks are compared with manual masks defined by experts and assessed for similarity using multiple voxel and surface based metrics and total segmented volume. A subset of subjects was scanned multiple times to assess the repeatability of the segmentations.

| METHODS
The study was approved by the University of Nottingham Medical School Research Ethics Committee (H14082014 and E14032013), and East Midlands Research Ethics committee REC reference: 17/LO/2036 and 15/EM/0274.

| MRI data acquisition
All kidney MRI scans were acquired on a 3T Philips Ingenia system (Philips Medical Systems, Best, The Netherlands) using a 2D T 2 -weighted half-Fourier single-shot turbo spin echo (HASTE) sequence optimized to achieve the maximum contrast between the kidneys and surrounding tissue TE = 60 ms, TR = 1300-1800 ms, SENSE factor = 2.5, refocus angle 120°, bandwidth, 792 Hz, FOV = 350 × 350 mm 2 , voxel size = 1.5 × 1.5 × 5 mm 3 and a slice gap of 0.5 mm with approximately 13 coronal slices, enough to image the entire kidney, 33,34 in a single 17-to 23-s breath-hold.
The data set consisted of 60 subjects, 30 HC (10 female, 20 male) with a mean age of 26 ± 11 (19-77) years and 30 CKD patients (6 female, 24 male) with a mean age of 59 ± 14 (19-80) years and mean CKD stage 3.5 ± 1.2 (1)(2)(3)(4)(5). Ten of the subjects (5 HCs and 5 CKD patients) were scanned 5 times in the same scan session for use as test data. In each test data scan session, subjects were repositioned between each acquisition (removed from the scanner, asked to sit up and move on the bed), additionally the scanner operator attempted to vary the acquisition geometry between each scan while still acquiring full kidney coverage. These repeated test data sets allow the consistency of the networks ability to measure TKV to be assessed.
In total, 649 2D image slices from the 50 subjects in the training data and 650 2D image slices from the 10 subjects in the test data were collected. A summary of the data collected is provided in Supporting Information Table S1 and Supporting Information Figure S1.

| Manual segmentation
The manual binary mask of the kidneys of each subject were generated by 1 of 3 observers (A, B, and C who had been trained on kidney segmentation and had an average of 2 years of experience), with each observer segmenting data from both the training and testing data sets. Kidney boundaries were manually traced using freely available software (MRIcron) and any area of non-renal parenchyma, such as the renal hilum and cysts, were excluded from the manual definition. Binary masks of the kidney were generated, and the volume of each kidney was computed from the product of the number of voxels in each kidney mask and the voxel volume. Separate kidney volume for the left and right kidneys was determined and summed to compute TKV. All measurements were performed by observers blinded for patient number and previous TKV measurements.
For the training phase, for each subject a manual mask was used from a single observer (randomized between observer A, B, or C). For the testing phase, all 5 scans from a given subject were segmented by a single reader with the 10 subjects being segmented by a mix of the 3 readers, that is, the test data comprised of subjects segmented by all readers but the repeat scans of each subject were segmented by the same reader. For 4 HC subjects from the test data set, manual masks were drawn by all 3 observers for all 5 repeat acquisitions to allow assessment of inter-observer variability in the manual masks. HCs were chosen for this analysis as they healthy kidneys have a more consistent morphology and therefore will give a best-case measure of observer variability and provide a comparison of the automated method to the highest standard of manual segmentation.

| Automated segmentation using convolutional neural network architecture
Voxel intensities were normalized between 0 and 255, where 0 was set to the mean voxel intensity minus 0.5 times the SD of that slice and 255 was set to the mean voxel intensity plus 4 times the SD of the volume. This empirically derived windowing led to a clear contrast between the kidneys and surrounding tissue while negating the effects of bulk signal changes between volumes. Each data set volume was then split into 2D coronal slices and resampled to a matrix size of 256 × 256. Twenty percent of slices were reserved for validation during the network optimization process, this validation data was used to monitor over-fitting and direct the optimization process between epochs. Once the data had been split into training and validation sets, the slice order was randomized within sets. Splitting the data before slice randomization limited the possibility of slices from only 1 subject being split over both the training and validation data sets. During training, data augmentation was applied. At the start of each epoch, a batch of images and their corresponding masks was selected at random from the training data and a series of random shifts (up to 25% of the image in both the horizontal and vertical direction), zooms (between 0.75 and 1.25 magnification), rotations (within a 20° range), and sheers (within a 5° range) were applied to the image/mask pair to produce different yet anatomically reasonable images. The weights of the network were then adjusted based on this augmented data before selecting a new batch of images for the next epoch. Augmenting the data reduces the tendency of a model to over-fit the training data and therefore increases accuracy when the model is applied to unseen images.
The U-Net consists of 2 fully CNN-like structures that are cascaded in the form of an encoder-decoder (auto-encoder) structure. The encoder is used for feature extraction and the decoder is used for feature mapping to the original input resolution. A summary of the network architecture is shown in Figure 1. The convolution layers use a set of small parameterized filters, referred to as kernels, to perform convolution operations to produce different feature maps of their input. Here, each convolution and deconvolution layer uses a 3 × 3 kernel. Activation layers use a rectified linear unit (ReLU). Following convolution at each resolution, max pooling with a stride 2 is used on the encoding half of the network.
The network was implemented using Keras (v2.2.4) 35 with a TensorFlow backend (v1.13.1) 36 in Python 3.6.9. All training was carried out on an NVIDIA Titan Xp graphics processing unit (GPU) (3840 CUDA cores, 12 GB GDDR5X). The network uses a Dice score loss function, where TP is true-positive, FP is false-positive, FN is false-negative, a value of 1 implies complete overlap between the automated mask, and the manual mask whereas 0 implies no overlap. This function is ideal for renal segmentation because it does not weight true negatives, which represent the majority of voxels input to the network and therefore means that although the network is training, it does not become trapped in a local minimum outputting solely background voxels. Training was carried out over 150 epochs using stochastic gradient descent with an initial learning rate of 0.01 and learning rate decay of 5 × 10 −7 and momentum of 0.8, these parameters help the optimizer converge quickly while also avoiding overshooting. As seen in Figure 2, after 150 epochs the validation Dice score plateaued whereas the training Dice score was still rising slightly, indicating that any further training would lead to over-fitting. Training took ~30 min.
Using the model to subsequently perform segmentation of renal masks from a given T 2 -weighted volume of the test data took ~9 s on a standard office computer with no GPU, which is the type of machine end users would have access to.

| Statistical analysis
Baseline demographics are reported as mean ± SD. Interobserver variability in manual segmentation and TKV was calculated by comparing the TKV of the manual masks each observer generated for a given volume, and also assessing the Bland-Altman and regression analysis. Intra-observer variability in manual segmentation was calculated by comparing the TKV of the 5 masks generated by an observer for a given subject. For each, the mean coefficient of variation (CoV; defined as SD/mean) and intraclass correlation (ICC) were used as measures of repeatability of TKV. Voxel-based (eg, Dice score) and surface based (eg, Hausdorff distance) metrics were also calculated between each observer.
The performance of the automated segmentation was assessed using multiple voxel and surface based similarity metrics. Performance was further assessed by determining the mean difference in TKV between the automatic and manual methods. Both actual and percentage (%) difference in TKV were evaluated. Bias (mean) obtained from the automatic and manual methods were assessed using a paired sample t test. The mean CoV and ICC were also used as measures of repeatability of the automated TKV.

| Characteristics of the training cohort
Data were collected using a T 2 -weighted HASTE sequence providing optimal contrast between the kidneys and surrounding tissue, examples shown in Figure 5, however, there is limited contrast between the left kidney and spleen because of their similar T 2 -weighting. Cysts of variable size are clearly visible in the kidneys of the CKD patient. The training data comprised 25 HCs (9 female, 16 male) with a mean age of 26 ± 12   19 male) with a mean age of 58 ± 15 (19-80) years and mean CKD stage 3.3 ± 1.1 (1)(2)(3)(4)(5). The manual TKV was 277 ± 60 mL, ranging between 145 and 422 mL. Including both HC subjects and CKD patients meant the kidneys had variable morphology (shape, size, and heterogeneous cysts) within the training data set. Supporting Information Table S1 provides the characteristics of data sets used for training and testing of the CNN, whereas Supporting Information Figure S1 shows the distribution of TKV within the training and testing data.

| Accuracy of manual segmentation
Four of the test subjects were each scanned 5 times, with the left and right kidneys in the 20 data sets each masked by Observers A, B, and C. The intra-observer and inter-observer variability for this manual segmentation was computed, as shown in Table 1 additionally, similarity metrics were used to assess the overlap between each observer's manual masks, Table 2. As a result of the large difference between in-plane and out-of-plane resolution (1.5 mm 3 vs. 5.5 mm 3 ) the Hausdorff distance is very susceptible to inaccuracies in the anterior-posterior direction; this metric is highly sensitive to noise and as such the 95th percentile is used to generate a more representative value. Bland-Altman plots and regression analysis of inter-observer variance in measured TKV are provided in Supporting Information Figure S2.

| Network testing
The trained network was used to predict segmentations of the 2D kidney slices and compute TKV for each of the unseen test volumes. The mean Dice score over the 50 test volumes was 0.93 ± 0.01 (0.94 ± 0.02 for HC and 0.92 ± 0.01 for CKD patients). The TKV predicted by the network was, on average, 1.2 ± 16.2 mL less than the manually segmented TKV   All values are quoted as mean ± SD. and therefore not significantly different (P = .615) (Figure 3). This accuracy was comparable for the HC and CKD cohorts, with automated CNN TKV measurements of 4.7 ± 17.7 mL greater than manual and 7.0 ± 12.4 mL less than manual, respectively. A summary of the CNN accuracy when evaluated using similarity metrics and volume difference from manual measures is shown in Table 3. Note a slightly larger discrepancy for the left compared to the right kidney. Figure 3 shows plots of the difference in volume between manual segmentation and automated segmentation of the test data set.

Observer
In Figure 4, the TKV predicted by the CNN is plot against the manual TKV, in 90% of subjects, the SD of TKV measurements between each volume for a subject was smaller when the TKV was measured using the CNN as appose to manually. The mean CoV and ICC were 2.7% ± 0.9% and 0.979, respectively, across the 5 repeats of the manually segmented test data (using masks from observers A, B, and C), compared to a value of 1.5% ± 0.5% and 0.993, respectively, for the automatic segmentations of the 5 repeats of test data. The CNN produced a significantly lower CoV than the manual segmentations (P = .008).
Representative examples of the output from the network for both HC and CKD data are shown in Figure 5. The automated CNN accurately segments the kidneys, and for CKD patients, often omits cysts from the masks.
Because this is a 2D CNN, it is important to assess the accuracy across the anterior-posterior 2D slices of the kidney. This was achieved by comparing the Dice score of the CNN to the inter-reader Dice scores, Figure 6. A decrease in accuracy in the outer slices can be seen in both the CNN and manual masks.
This decrease in accuracy manifests itself on the outer slices of the volume, where the proportion of kidney per slice is smaller and as such the 2D network, with a lack of spatial context in the anterior-posterior direction, finds these outer slices more challenging. This decrease in accuracy can partly be explained by the fact that larger structures (in terms of number of voxels) will in general produce higher scores for comparable errors because the vast majority of errors are on the perimeter of the kidney in each slice, slices with fewer voxels of kidney have a smaller area to perimeter ratio.

| DISCUSSION
In this study, a 2D CNN has been trained to generate automatic segmentations of HC and CKD patients. Segmentations of the left and right kidneys are computed from which total kidney volume is estimated. The CNN was trained on both HC and CKD kidneys with a range of TKV (144.76-422.49 mL), which included the presence of cysts. The automated segmentation by the CNN yielded a mean Dice score of 0.93 ± 0.01 and took an average time of 9 s to measure TKV compared to 15-30 min 12 for manual segmentation. The automated CNN can be run as a self-contained package with the data and program freely available (https://github.com/alexd aniel 654/Renal_Segme ntor). 37 Note the software released at present can only be used to process coronal HASTE images and will not be accurate with other geometries and/or contrasts. To accurately segment other geometries and/or contrasts the network would need to be trained using a different data set, this cannot be done using the self-contained package and would necessitate the use of a GPU.

| Evaluation of methodology
The network performed with high precision on the test data with a 1.2 ± 16.2 mL, statistically insignificant, discrepancy between manual and automated TKV measurements. Table 3 shows the agreement between the CNN and manual masks  is higher for the right than left kidney, this is in part because of the proximity and lack of contrast between the left kidney and the spleen making distinguishing this boundary difficult for the CNN. This difficulty also leads to inconsistencies in manual masks, borne out by the increased CoV and decreased ICC and similarity metrics of the left kidney when compared to the right kidney in Tables 1 and 2 assessing the variability in manual masks between observers. From Table 3, it can also be seen that the agreement between the CNN and manual masks is greater for the HC cohort than the CKD cohort, this is expected because of the increased variation in kidney morphology and the presence of cysts in the CKD cohort. Figure 3 shows that the difference between the manual TKV and CNN predicted TKV is not dependent on the true TKV, therefore, the training data are balanced and well augmented because the network is able to accurately perform over the full range of kidney size in the test data.
Here, 5 volumes of test data were collected for each subject by repositioning the subject in the scanner within an hour scan session, and therefore, any variance in measured TKV is purely because of inaccuracies in the kidney ROI definition. On assessing the correlation between manual and CNN measured TKV in Figure 4, it can be seen that, in 90% of subjects the intra-observer variance in manual TKV between the segmentation of the 5 volumes collected in each subject is larger than using the CNN to estimate TKV, as reflected by the lower CoV and increased ICC of the TKV measured using the CNN (CoV 1.5% ± 0.5%, ICC 0.993) compared to the manual measures (CoV 2.7% ± 0.9%, ICC 0.979). Because the network is trained on the kidney segmentations from 3 observers (A, B, and C), it has been optimized by inheriting the most accurate tendencies of each observer, for example, 1 observer may have been very accurate when excluding cysts but not as accurate at defining the kidney-spleen boundary. The network will have learnt to exclude cysts from this observer but to delineate between kidney and spleen from another observer. Therefore, the network can become more precise than each individual observer's manual segmentations. This increased precision can be seen in Figure 3 when compared to Figure 4 where the variance in difference in TKV is driven by the larger variance in manual TKV. The smallest TKV per subject is consistently overestimated when compared to its manual mask and vice versa the largest manual TKV per subject is often an underestimation compared to the manual TKV. Figure 5 illustrates the masks produced by the manual segmentation and the CNN for both a HC and CKD patient. For the HC, the CNN includes more voxels around the edge of its mask than manual segmentation, and the network is more anatomically accurate, for example, where the interface between the kidney and spleen is very narrow, the CNN predicts the kidney is adjacent to the spleen whereas the observer's manual segmentation leaves a gap. The CKD data shown in Figure 5 includes a cyst in each of the kidneys. The network was trained on a combination of healthy and CKD data, with 19 of the 25 CKD training data sets containing at least 1 cyst. The CNN can be seen to segment out the cysts, despite their highly variable morphology and prevalence in the overall training data.
The amount of augmentation applied to the training data was empirically derived (random shifts up to 25% of the image in both the horizontal and vertical direction, zooms between 0.75 and 1.25 times magnification, rotations within a 20° range, and sheers within a 5° range) and led to the potential for large transforms being applied to the data and masks if the extremes of each transform were randomly selected. This large degree of augmentation was advantageous because it mirrors the large variation in acquisition planning in abdominal imaging.
A 2D CNN was used to process each 2D slice of a full volume, rather than a 3D volume. This was advantageous for the relatively small training data set the network was optimized on, because it avoids over-fitting and allows the network to easily be used on volumes of variable slice number. However, this can come at the expense of accuracy because 2D CNNs do not leverage the information from adjacent slices in the segmentation as is done in 3D CNNs, but 3D CNNs come with a computational cost as a result of the increased number of parameters used. 3D networks have successfully been implemented on neural data using patching methods where the image volume is divided up into smaller cubes 26 to reduce memory requirements and allow for differing input shapes. Although this works well in the brain, there are a number of reasons why this method may not be as successful for body F I G U R E 4 The TKV predicted by the CNN plot against the manually segmented true TKV with each subject plot in a different color. The SD measured using both methods are shown as error bars originating from the mean of each subject. The dotted line represents perfect correlation between the CNN and manual segmentation | 1133 applications. The out-of-plane resolution is significantly less than the in-plane resolution; this results in far fewer slices in 1 direction than the other 2. To avoid over-fitting for a certain number of slices, for example, training on an 11 slice image with a 11 3 patch, and subsequently the network not performing well when the patch is applied to a 16 slice image, the patch would need to be much smaller than the number of slices, therefore diminishing the benefits of the 3D F I G U R E 5 Representative raw test data and corresponding masks of a HC and CKD subject. Manually generated masks are shown in blue, automatically generated masks are shown in red and the overlap of the 2 is shown in magenta methodology. Additionally, the extra memory requirements for a 3D network limit the ease of use of the software for inference on many standard office computers.

| Future directions
Future work will collect more training data to compare the 2D CNN with a 3D CNN to ascertain if the potential increase in accuracy is worth the increased hardware requirements and reduced generalizability. Here, the CNN has been developed for use on a T 2 -weighted sequence and has not been validated on T 1 -weighted images. This image contrast was chosen as a result of recent publications comparing T 1 -and T 2 -weighted images for TKV assessment reporting that T 2weighted images provide better quality to enable TKV measurements, leading to improved reproducibility with lower intra-and inter-reader variability. 38 T 1 -weighted data could be registered to the T 2 -weighted data and used as an extra channel to inform the segmentation.
This network was validated on healthy subjects and CKD patients, but has not been trained and validated on subjects with ADPKD. These subjects have many more cysts in their kidneys, although the CNN was able to segment cysts encountered in the CKD cohort, it would be beneficial for future work on ADPKD to retrain the network with HC, CKD, and ADPKD data, where TKV is a recognized biomarker of disease progression.
Another common segmentation task in renal imaging is generating an ROI for the renal cortex and medulla. There are some automated methods of achieving this once a total kidney mask has been produced, 1,39 however, there has been no work on the application of deep learning to this task. In addition to the acquisition of the T 2 -weighted data set used here, a T 1 -weighted data set designed to optimize the contrast between cortex and medulla was also collected on each subject. 34 Using these data, it may be possible to develop this method further such that an automated mask for each tissue type is produced.

| CONCLUSIONS
A CNN has been successfully applied to accurately segment the kidneys from T 2 -weighted renal MRI data and measure TKV in both HCs and CKD patients with higher than human precision. In the future, this will be used in clinical trials to study large numbers of CKD patients for serial measurements of TKV to monitor natural history or response to treatment.