Automated and accurate segmentation of leaf venation networks via deep learning

Leaf vein network geometry can predict levels of resource transport, defence, and mechanical support that operate at different spatial scales. However, it is challenging to quantify network architecture across scales, due to the difficulties both in segmenting networks from images, and in extracting multi-scale statistics from subsequent network graph representations. Here we develop deep learning algorithms using convolutional neural networks (CNNs) to automatically segment leaf vein networks. Thirty-eight CNNs were trained on subsets of manually-defined ground-truth regions from >700 leaves representing 50 southeast Asian plant families. Ensembles of 6 independently trained CNNs were used to segment networks from larger leaf regions (~100 mm2). Segmented networks were analysed using hierarchical loop decomposition to extract a range of statistics describing scale transitions in vein and areole geometry. The CNN approach gave a precision-recall harmonic mean of 94.5% ± 6%, outperforming other current network extraction methods, and accurately described the widths, angles, and connectivity of veins. Multi-scale statistics then enabled identification of previously-undescribed variation in network architecture across species. We provide a LeafVeinCNN software package to enable multi-scale quantification of leaf vein networks, facilitating comparison across species and exploration of the functional significance of different leaf vein architectures.

• Leaf vein network geometry can predict levels of resource transport, defence, and 20 mechanical support that operate at different spatial scales. However, it is challenging to 21 quantify network architecture across scales, due to the difficulties both in segmenting 22 networks from images, and in extracting multi-scale statistics from subsequent network 23 graph representations. 24 • Here we develop deep learning algorithms using convolutional neural networks (CNNs) 25 to automatically segment leaf vein networks. Thirty-eight CNNs were trained on subsets 26 of manually-defined ground-truth regions from >700 leaves representing 50 southeast 27 Asian plant families. Ensembles of 6 independently trained CNNs were used to segment 28 networks from larger leaf regions (~100 mm 2 ). Segmented networks were analysed using 29 hierarchical loop decomposition to extract a range of statistics describing scale transitions 30 in vein and areole geometry. 31 • The CNN approach gave a precision-recall harmonic mean of 94.5% ± 6%, 32 outperforming other current network extraction methods, and accurately described the 33 widths, angles, and connectivity of veins. Multi-scale statistics then enabled identification 34 of previously-undescribed variation in network architecture across species. 35 Introduction 45 Plant leaves are structured by a vein network that ranges in geometry from dendritic fans with 46 few branches and no loops (e.g. Ginkgo biloba) to hierarchical forms with many loops (e.g. Acer saccharum) (Trivett & Pigg, 1996;Roth-Nebelsick et al., 2001). The vein network has multiple 48 roles that include transport of water, nutrients and sugars through the xylem and phloem tissues 49 (Brodribb et al., 2010;Carvalho et al., 2018;Katifori, 2018), controlled deformation during bud 50 burst (Niklas, 1999), and mechanical support and resistance to damage in the mature leaf (Sack & 51 Scoffoni, 2013; Sharon & Sahaf, 2018). Measurements of vein architecture have provided 52 insights into leaf development (Kang & Dengler, 2004)  of paleo-environments from leaf fossils (Manze, 1967;Uhl & Mosbrugger, 1999;Blonder et al., 57 6 BX43) with 2x apochromat objective and an Olympus SC100 color camera (3840 x 2748 pixel 145 resolution). 9-16 overlapping image fields were stitched together to obtain a complete image of 146 the sampled area using the image capture software. Final images had a resolution of 595 pixels 147 mm -1 and a typical extent of ~7000 × 7000 pixels. 148 149 Images were pre-processed by retaining the green channel (which had the greatest contrast), and 150 applying contrast-limited adaptive histogram equalization (CLAHE, Zuiderveld, 1994), with a 151 400 × 400 pixel window and a contrast limit of 0.01. 152 Ground-truth tracing 153 All veins within a polygonal region-of-interest (ROI) of approximately 700×700 pixels were 154 manually traced at full-width for each image, using a digitizing tablet (Cintiq 22HD, Wacom) and 155 image processing software (GIMP). In addition, any veins in the complete image with width >200 156 µm were manually traced, as these veins were not routinely included in the GT-ROIs used to train 157 the CNNs. Inclusion of manual large veins avoided over-segmentation of texture within the larger 158 veins which were occasionally recognized by the CNN as separate veins. Alternatively, we used 159 down-sampled images with the same CNN to automatically segment larger veins at full width. classifications. 'Deep' neural networks include multiple convolutional layers whilst 'learning' 7 occurs via a backpropagation algorithm (Werbos, 1974) that optimises the weights of the network  175   by comparing the output result to the GT.  176   177 We used a CNN based on the U-net architecture (Ronneberger et al., 2015). The first 178 convolutional layer comprised 32 small (3 × 3) kernels that were convolved with the image to 179 give 32 feature maps (Fig. 1). The feature maps were batch-normalized to give a mean of zero 180 and variance of 1 to speed up training (Ioffe & Szegedy, 2015), followed by non-linear 181 rectification (ReLU) that sets negative inputs to zero to improve training performance (Glorot et 182 al., 2011). A second convolution using another set of 32 small (3 × 3) kernels was used to 183 increase the receptive field in this layer (Simonyan & Zisserman, 2014), followed by batch-184 normalization and ReLU (Fig. 1). The size and resolution of the input feature maps and the output 185 feature maps were preserved when passing through the convolutional blocks, with padding 186 applied during convolutions. To generate the next layer, the output from each convolution block 187 was subject to non-linear down-sampling using the maximum from a 2 × 2 pixel window with a 188 stride of 2, effectively halving the resolution of the feature maps in both dimensions. This gave 189 five levels of feature maps in the down-sampling arm of the network which provided increasingly 190 higher-level representation of the input image patches. 191

192
We also followed the approach of U-Net and Fully Convolutional Networks (FCN) (Long et al., 193 2015), and kept the same number of feature maps for layers with the same resolution, whilst 194 doubling the number of feature maps when the resolution was halved by down-sampling, and 195 halving the number of feature maps when resolution was doubled in the up-sampling arm. The 196 feature maps at each down-sampled layer were also copied and concatenated with those at the 197 same resolution in the up-sampling arm, to give feed-forward shortcut connections that retain the 198 high-resolution information required to yield a per-pixel output (Long et al., 2015). The 199 activation function of the final fully connected output layer was sigmoid and gave a 1:1 pixel 200 classification of the vein probability. This resulted in nine convolutional blocks in total (Fig. 1). 201

Data augmentation and training data sets 202
The manually digitized GT images were used to train and evaluate the CNN. Each GT region was 203 resampled to augment the training data and to prevent over-fitting. 32 samples per image, fully 204 contained within the GT region, were extracted with randomized variation in the box size 205 (256 × 256 ± 20% pixels), rotation angle (between 0 and 2π radians), (x,y) shift in centroid 206 coordinates, and randomized image reflection. Augmented samples were then resized to 207 256 × 256 pixels (Fig. 2). 208

Network Training 209
The CNN was initialized using the Glorot uniform initializer with random weights drawn from a 210 uniform distribution (Glorot & Bengio, 2010). At the start of each training run, 120 non-211 overlapping images were reserved as the test data set. For each training epoch, 480 leaf images 212 were randomly selected from the remaining images and the manually digitized ROIs used as 213 training samples. The remaining 120 images were used to validate the network performance. The 214 trained CNN was then used to segment the set of 120 full-size test images that had not been used 215 in training or validation. This required seven iterations to segment all of the original full-size 216 images. The process was repeated to give an ensemble of six independently trained networks for 217 each image (Hansen & Salamon, 1990;Sollich & Krogh, 1996). A flow diagram of the process is 218 shown in Fig. 2. Of the 727 samples in the dataset, seven had digitized regions which were too 219 small for image patch extraction, which were excluded from training, but were still used for 220 validation. At each training iteration, a mini-batch of eight 256 x 256 patches, constrained by the 221 memory limit of each GPU, was sampled and fed to the network in sequence to ensure robust 222 convergence by avoiding local minima. The pixel-wise cross entropy was used as the loss 223 function during training, and the Dice similarity metric was used for network selection. The U-Net algorithm was run on one of two clusters, either with 2x Intel E5-2609v2 processors 230 (total 8 cores at 2.5 GHz, 128GB RAM), or 2x Intel E5-2609v3 processors (total 12 cores at 1.9 231 GHz, 128GB memory). Image processing occurred in parallel on three graphical processing units 232 (GPUs) (nVidia, GeForce GTX 1080 Ti, each with 3580 cores and 11GB memory). All CNN 233 code was implemented in Keras with Tensorflow using Python 3. Training the CNN on the 234 calibration data required 8 min per epoch using 3 GPUs and 1 CPU. 235

236
The trained networks were also imported into MATLAB (Mathworks, Natick, MA) for 237 incorporation into a GUI interface (see user manual and software provided in Supplementary transposedConv2dLayer layers using a stride of 2 and uniform weights equal to 1. 240

Vein prediction on the full-size images 241
For vein prediction over the full 10 mm x 10 mm region, images were processed as overlapping 242 256 × 256 pixel patches. At each iteration, a mini-batch of eight patches was sampled at random 243 without replacement, with a separation (stride) of 50 pixels in both x and y to give >25-fold over-244 sampling, and a small overlap between patches. The results were assembled back to the original 245 image size to give a probability map of predicted values, with each pixel classified at least 25 246 times within different image contexts due to the stride overlap. Calculation of the CNN on a full-247 sized image took ~8 minutes on the GPU clusters. The analysis code running on a single CPU 248 took ~20 min depending on the complexity of the network and amount of down-sampling. The 249 typical processing speed was approximately 60,000 px s -1 . 250

Conversion of the CNN probability map to a weighted network 251
For each full-size leaf image (Fig. 3a), the six CNN probability maps were averaged to generate a 252 mean probability map (Fig. 3b), down-sampled by a factor of 2 to reduce subsequent 253 computational costs, and binarised using a threshold based on the average statistic from 254 Precision-Recall analysis (see below). Only the largest connected component was retained. 255 Damaged and background regions that were defined manually for each image were masked from 256 analysis. 257

258
The resultant full-width binary (FWB) image was thinned to a single pixel wide skeleton (Fig.  259 S1b). Simple thinning algorithms gave an irregular zig-zag backbone around junctions. A better 260 centerline was realized using a hybrid skeletonisation protocol that combined watershed 261 segmentation to find all closed loops (Figure 3c, magenta), with thinning to extract tree-like 262 branches (Figure 3c, green). In brief, watershed segmentation was applied to the Euclidean 263 Distance Map (EDM) from the FWB image to find all closed loops (Fig. 3c, magenta). Tree-like 264 branches were extracted as the difference between the complete thinned skeleton (Zhang & Suen, 265 1984), and the skeleton after removal of spurs (Fig. 3c, green). The 'trunks' of the tree-like 266 skeleton were then re-connected to the nearest pixel in the watershed loop skeleton, determined 267 from the nearest-neighbor transform of the EDM, and the pixel positions written back into the benefit that loops and tree-like portions of the skeleton were automatically separated and could be 270 analysed separately (Fig. 3c). 271

272
The pixel skeleton was converted into separate vein segments using the Matlab edgelink 273 algorithm (Kovesi, 2000), which provided a set of pixel-coordinates for each vein in the network. 274 The vein segments were converted to a graph representation (  Table  291 S2), node (Fig. S1e , Table S3), areole, and polygonal area (Fig. S1f, Table S4 and S5, 292 respectively). In addition, a set of summary statistics describing the distribution of each metric 293 were also calculated (Table S7), adopting a comparable terminology to Larese et al., (Larese et 294 al., 2014), with prefixes V -veins, N -nodes, A -areoles, and P -polygonal areas. Vein metrics 295 were further sub-divided into the total vein network (Tot), veins forming loops (Loop), and veins 296 forming trees (Tree). The summary statistics were drawn from mean (av), median (md), min (mn), 297 max (mx), standard deviation (sd), or skewness (sk) of each metric distribution. In the case of 298 angular metrics, values were summarized using circular statistics (Berens, 2009). Each summary 299 statistic was then given a three-letter code. 300 The revised 'center-weighted' width value (three-letter code, Wid), along with the corresponding 302 length (Len), excluding vein overlap and measured as the sum of the Euclidean distance between 303 edge pixels, surface area (SAr), volume (Vol), tortuosity (Tor), and orientation (Ori) were added 304 to the vector of edge properties, along with the average edge intensity (Int) and average CNN 305 probability (Prb) ( Figure S1d). A vector of properties was also calculated for each node that 306 included the node degree (K), diameter of the parent branch (BD0), and daughter branches (BD1 307 and BD2), their orientation (OD0, OD1, and OD2), and branching angles (A10, A20, and A21). 308 The angular metrics were calculated for straight-line segments joining the node and the midpoint 309 of each vein (Figure S1e

Dual-graph representation and hierarchical loop decomposition (HLD) 319
A dual-graph of the vein skeleton was calculated that joined adjacent polygons, with nodes placed 320 at the centroid of the polygons, and the connecting edges weighted using the width of the 321 intervening vein segment (Fig. 3f). The dual-graph was converted to a nested branching tree using 322 hierarchical loop decomposition (HLD, Katifori & Magnasco, 2012) by progressively removing 323 the thinnest edges in sequence and fusing the adjacent polygons until only a single polygon 324 remained (Fig. 3g,h). This process determined how vein loops are nested within larger vein loops 325 across spatial scales. A subset of vein and polygonal area metrics were calculated at each step, 326 allowing analysis of network metrics across the scale of the fusion events (SI Table 6). Analysis 327 was restricted to fully bounded regions (Fig. 3g)  , in this case using = 2, which weights recall higher than precision. The 342 threshold that gave the highest F1 or Fβ2 score was used to segment the image. 343

Comparison with other vein segmentation methods 344
To compare the performance to other methods, the GT-ROI was analysed using a range of 345 standard local adaptive thresholding methods, including Midgrey, Niblack (Niblack, 1985), 346 Sauvola (Sauvola & Pietikäinen, 2000), and Bernsen (Bernsen, 1986)  1998) was used with vein thickness between 4-9 pixels, applied over four scales using a Gaussian 357 image pyramid (Burt & Adelson, 1983) to cover the largest veins. We also tested improved 358 Kovesi (Kovesi, 1999;Kovesi, 2000), that we have previously used to segment fungal (Obara et  362 al., 2012), slime mold (Fricker et al., 2017) and ER networks (Pain et al., 2019). We used the 363 normalized local weighted mean phase angle ('Feature Type') to give intensity-independent 364 enhancement of vein structures, initially calculated over 3-5 scales and 6 orientations, and then 365 applied to an image pyramid to cover larger scales. In both the vesselness and phase-congruency 366 enhancement, noise was suppressed by setting all variation less than 0.1-0.2 in an extended 367 minimum transformation to zero. As an alternative morphology-based approach, we also tested 368 the multiscale Bowler Hat algorithm (Sazak et al., 2018), which is designed to be more robust at 369 junctions and less sensitive to interference from blob-like inclusions. In all cases, the resultant 370 multiscale representation was collapsed to a single image using a maximum projection and 371 segmented over varying thresholds from 0 to 1 in 0.05 increments to find the optimum 372 performance as judged by the maximum F1 or Fβ2 statistic in Precision-Recall comparison to GT. 373 A full set of processing parameters for these methods is given in Table S1. 374

375
As the vein enhancement approaches may be better suited to extraction of the pixel skeleton, 376 rather than a binary image covering the full-width of the vein, the P-R analysis was also run 377 following conversion of the binary image at each threshold value to a single-pixel wide skeleton. 378

Data and algorithm availability 379
All MATLAB scripts and the standalone LeafVeinCNN software and manual (Fig. S2) (Fig. 4). In each case the centre-line of each 389 vein was correctly identified, the topology of the complete network was accurately segmented, To quantify ensemble CNN performance, we examined a masked ROI from each full-size image 393 (Fig. 5a) containing the manually delineated GT (Fig. 5a'). The ensemble CNN applied to this 394 ROI provided a smooth, high-contrast probability map of vein identity (Fig. 5b) that was 395 thresholded to give a FWB image (Fig. 5b') and compared to the GT using P-R analysis (Fig. 5l). 396 The value used to threshold the CNN probability map was systematically varied to generate a P-R 397 curve, and the optimum value determined from the maximum F1 or Fβ2 metrics (see Fig. 5l, 398 circled and asterisk points, respectively). The P-R image constructed using the Fβ2 threshold value 399 (Fig 5b') showed that the majority of the segmentation matched the GT (green), whilst the very 400 tips of the FEVs were slightly clipped (shown as false negatives, FN in red), and the vein widths 401 slightly over-estimated with some false positives (FP, shown in blue). 402 403 Ensemble CNN performance was affected by the contrast range present in the image initially 404 (Fig. S3), but this could be completely recovered using CLAHE to enhance contrast (Fig. S4). 405 Furthermore, performance was stable within a wide range of CLAHE settings (Fig. S5), 406 suggesting the most critical feature is to cover the full contrast range, rather than the fine tuning 407 the local adaptive features. Increasing levels of blur degraded CNN performance (Fig. S6), but 408 only at levels that would be unacceptable out-of-focus in the original image. Possibly the 409 strongest effect on CNN performance was the pixel resolution of the original image, where fine 410 veins were progressively lost as the resolution decreased (Fig. S7). Nevertheless, performance 411 was almost completely recovered by simple image interpolation back to the standard pixel size 412 (Fig. S8). 413 CNNs performed better than other network extraction algorithms 414 P-R curves for the other enhancement methods were worse that the CNN result (Fig. 5l). All the 415 local adaptive thresholding approaches tested (Midgrey, Niblack, Bernsen and Sauvola, Fig. 5c-f,  416 respectively) captured much of the main vein structures, but missed or fragmented fine veins (Fig.  417 5c'-f ', red), and included some noise or non-vein structures (Fig. 5c'-f ', blue). The Hessian based 418 techniques ('Vesselness', MFATλ and MFATp, Fig. 5g-i, respectively) showed slightly greater 419 selectivity for the veins over background compared to adaptive thresholding methods, but had 420 errors in the width estimate (Fig. 5g'-i'). The 'Vesselness' retained more of the intensity 421 information from the original image and tended to give low intensities at junctions (Fig. 5g, g'), 422 whilst MFATλ and MFATp gave higher contrast and better resolution of junctions ( Fig. 5h-i, h'-i').
veins, irrespective of their size, and captured the topology quite well (Fig. 5j), but gave numerous 425 small errors along the boundary of the binarised veins, and included discrete round structures that 426 we infer to be glands (Fig. 5j'). The Bowler Hat enhancement (Fig. 5k) improved the selection of 427 veins over non-veins, but was still dominated by the original intensity information, making 428 subsequent threshold selection difficult (Fig. 5k'). 429

430
As several of these enhancement methods were originally designed to facilitate extraction of the 431 skeleton, rather than FWB image, we also compared their performance against the GT skeleton 432 ( Fig. 5a''-k''). The performance of the CNN was even better in this comparison (F1 = 0.98, Fβ2 = 433 0.99), and was robust to threshold selection (Fig. 5m), with the optimal threshold across all 434 species of 0.534 ± 0.202 s.d. for F1, 0.378 ± 0.197 s.d. for Fβ2. The other methods lagged behind 435 the CNN result, but typically showed better P-R performance for the skeleton compared to the 436 full-width. However, they also showed much greater sensitivity to threshold selection, with very 437 rapid fall-off in performance moving away from the optimal threshold (Fig. 5m). Without a GT 438 for each image, such automatic optimization of the threshold value for any of the algorithms 439 would not be possible, so methods that are highly sensitive to the threshold value are likely to 440 perform much worse when used for typical data sets that lack GTs. 441 442 When the above procedure was applied to all leaves in the dataset, the Fβ2 statistic values for the 443 CNN were higher than for all other algorithms: 0.945 vs 0.837, t=-36.463, df=1062.7, p<10 -16 444 (Fig. 6a). Thus, CNN enhancement gave accurate and robust segmentation of these leaf vein 445 networks and out-performed the currently available network enhancement and segmentation 446 methods tested. Performance was also consistent across plant families, and by inference, across 447 diverse network architectures (Fig. 6b). 448 449 As the best PR performance based on pixel classification may not equate to the optimal network 450 extraction, we also examined the impact of varying thresholds on a set of basic network metrics 451 including Area eccentricity (Fig. 7a), Area mean (Fig. 7b), FEV ratio (Fig. 7c), No. Junctions 452 (Fig. 7d), Length density (Fig. 7e), and Node density (Fig. 7f). For all metrics, the CNN approach 453 showed the lowest fractional error from the GT skeleton, followed by the BowlerHat, Phase-454 congruency and Hessian-based techniques. The adaptive threshold methods performed very 455 poorly, with fractional errors 1-2 loge units adrift.
A range of metrics were calculated following CNN network extraction from the full-size images 458 for the veins, areoles and polygonal regions (Tables S2-5). Each metric had distribution 459 characteristic of the species. For example, the relationship between vein width and length (Fig.  460 8a) or between loop circularity and elongation (Fig. 8b), showed markedly different patterns 461 between the six species illustrated. For example, in some species (A. odoratissimus and D. 462 lanceolata), the vein width showed little separation into distinct vein orders, whilst for the others, 463 discrete vein classes were readily identifiable (arrows in Fig. 8a), but with markedly different 464 numbers in each class. 465

Venation comparisons using HLD and multi-scale metrics 466
Given the limitations of single scale metrics, we explored analysis using HLD to progressively 467 group regions of the leaf into clusters at different scales to generate multi-scale statistics for each 468 metric (Fig. 3g,h). For example, the spatial scaling of vein density (VTotLD) changed as a 469 function of the fusion vein width (Wid) (Fig. 9a). At small scales, some species had a much 470 higher vein density (e.g. P. laxiflora). However, this reduced rapidly once a specific size class of 471 veins was removed (50µm in this case - Fig. 8a). In some other species (e.g. The level of redundancy was also reflected in the amount of network branching across spatial 484 scales measured by the minimum spanning tree (MST) ratio (Fig. 9c) The space-filling geometry of the network also changed across spatial scales (Fig. 9d). Thus, the 493 median areole circularity decreased with vein size class in all species, but at different rates. 494 Areole circularity was similar between species at small spatial scales, but at larger spatial scales. 495 D. lanceolata had the least circular areoles due to fusing regions becoming more elongated with a 496 more parallel vein architecture, while A. odoratissimus maintained the most circular areoles. 497

CNNs improve network segmentation accuracy and robustness 499
Ensemble CNNs provide an alternative approach for leaf vein network segmentation with high 500 accuracy, as judged both by P-R analysis against GT tracings using pixel classification and a set 501 of network measures. Ensemble CNNs also provided consistently better segmentation of leaf vein 502 networks from field-collected leaves across many different species and leaf vein architectures 503 than other currently available network extraction algorithms. In addition, the CNN probability 504 map produced smooth, full-width segmentation of veins, even in the presence of image artifacts 505 such as differential clearing, air bubbles, or trichomes. This in turn led to improved robustness, 506 precision and accuracy in graph extraction, vein width estimation, and subsequent quantitative 507 metrics. 508 509 We note that this deep learning approach contrasts with previous neural network approaches for 510 leaf architecture (Grinblat et al., 2016), which used simpler network architectures and more 511 limited training data, with a goal to develop a classification system rather than extract the 512 venation network itself. 513 514 Ensemble CNNs allowed rapid characterization of minor vein networks at multiple spatial scales 515 with minimal manual intervention, removing a major bottleneck of manual tracing. For 516 comparison, hand-tracing the equivalent area predicted by the CNNs would have required 517 >36,000 person-hours to complete. Nevertheless, the results presented here are truncated by the 518 leaf sample extents, which prevented exploration of larger scales than ~10 × 10 mm 2 areas. 519 The trained ensemble CNNs used input images with pixel resolution (1.68 µm pix -1 ) and contrast 521 adjusted to cover the full dynamic range following CLAHE, and can be used without 522 modification to analyse other leaf images of comparable contrast and resolution. Other leaf vein 523 images at lower spatial resolution can still be segmented well using up-sampling with 524 interpolation, to ensure that the smallest veins are ~5 pixels wide. A typical example is shown in 525 Fig. S9 for a leaf originally collected at 6.7 µm pix -1 using the LeafVeinCNN GUI (Fig. S2). Indeed, we recently have made such an analysis of these network-function linkages using 549 networks extracted from this dataset (Blonder, 2020), and we propose that multi-scale statistics 550 may enable a more quantitative and richer description of network architecture that supplements 551 the utility of qualitative classifications. 552

Acknowledgments 554
We thank David Ford for assistance with the GPU cluster. This work was supported by the 555