Body composition assessment by artificial intelligence from routine computed tomography scans in colorectal cancer: Introducing BodySegAI

Body composition is of clinical importance in colorectal cancer patients, but is rarely assessed because of time‐consuming manual segmentation. We developed and tested BodySegAI, a deep learning‐based software for automated body composition quantification from routinely acquired computed tomography (CT) scans.


Introduction
Computed tomography (CT) examinations are crucial for the diagnosis and follow-up of patients with colorectal cancer (CRC). In addition, the routine CT scans contain valuable and high-precision information about body composition such as skeletal muscle (SM), visceral adipose tissue (VAT), subcutaneous adipose tissue (SAT), and intermuscular and intramuscular adipose tissue (IMAT). Because low SM and high IMAT are predictors of reduced survival in CRC patients, 1,2 body composition data that may be collected from standard CT scans should be used in clinical practice. High amounts of VAT may pose a larger risk for CRC patients than total fat mass, 3 and VAT is also negatively associated with survival in CRC patients. 4 IMAT itself is associated with insulin resistance and loss of muscle strength and function. 5,6 Although body composition data from clinically acquired CT scans may be used to tailor nutritional interventions 7 and optimize cancer treatment, 8,9 it is currently not part of standard cancer treatment.
The lack of utilization of CT-based body composition data may be due to the absence of accurate and automated tools. Even though CT is described as one of the golden standards for the measurement of body composition, 10 segmentation has typically been performed manually or semi-manually. Both manual and semi-manual processes are time consuming and requires extensive resources, anatomical knowledge, and software training. This limits the use of CT for body composition purposes in clinical practice and large-scale clinical trials.
The ongoing rapid development of deep learning has revolutionized the field of automated image segmentation, including quantitative CT-based body composition assessment. 11 In particular, convolutional neural networks (CNNs), directly inspired by brain neurons, are well fitted to process large amounts of imaging data. 12 Several studies have shown promising results using deep learning-based methods for quantification of abdominal SM or adipose tissues. [13][14][15][16][17][18][19][20][21][22][23][24][25][26] As CNNs require large amounts of data, it is common to use multiple available data sets for training purposes. However, no software is trained and tested on CT slices from different anatomical levels acquired from CRC patients. Previous studies typically rely on single slices at L3. 13,16,17,[19][20][21]26 This may limit the usefulness of these models as clinically acquired CT scans differ according to the protocol used or the region of interest. Therefore, the main aim of this study was to develop and test the performance of an automatic software named BodySegAI for the quantification of abdominal SM, IMAT, VAT, and SAT in CT scans from CRC patients. To overcome the limitation of using a single slice at L3, we aimed to train the model using multiple CT slices, from both the abdominal and the pelvic area (L2 to S1). Additionally, we compared the performance of BodySegAI against a similar software named AutoMATiCA. 17 Furthermore, we also aimed to investigate the time effectiveness of BodySegAI compared with semi-manual segmentation by human readers.

Network set-up and training
We used a two-dimensional U-Net 27 adapted in a similar fashion as by Weston et al. 14 with six down-sampling and up-sampling steps. Each down-sampling step had two convolutions (3 × 3), a hyperbolic-tangent (tanh) activation function, lastly followed by max pooling (2 × 2). Each up-sampling step had interpolation (2 × 2) and two convolutions (3 × 3) with tanh activation. In each step, the convolution output from the down-sampling side was concatenated with the corresponding up-sampling step. The last layer was a convolution (1 × 1 × 3) followed by a sigmoid function. The first and last steps had 16 kernels, and the bottleneck layer of the U-net had 16 * 2 **6 = 1024 kernels, resulting in an 8 × 8 × 1024 representation. The model had 31 109 747 trainable weights. The highest score for each pixel above a threshold of 0.5 determined the final segmentation. The input image size was one channel of size 512 × 512, and the output was three channels of the same size. VAT and SAT were classified in separate output channels. The third output channel represented a combined class of IMAT and SM, which was then split into separate IMAT and SM masks by thresholding according to the Alberta protocol (described in the 'image segmentation' paragraph). All pixels that had corresponding raw image value of À30 HU or less were determined to be IMAT. Any segmented pixels that were outside the predefined threshold values for the four segmentation labels were discarded. Hyper-parameters and pre-processing and post-processing steps were selected experimentally using 300 validation CT slices from the training data set. The model was later trained using all training and validation data. Input images were clipped to À400-600 HU and normalized by subtraction of the mean and division by the standard deviation of the training data set (values À536 and 493, respectively). We used the DICE loss function, a batch size of 16, and Adam optimization with a descending learning rate of 0.0001 down to 0.000001. The model was trained for 250 epochs. During training, there were simple linear augmentations used with translation by up to 50 pixels in any direction, up to 5% rotation, 10% scale, and 40% right/left flipping. The programming environment was Python v. 3

Training data set
A flow chart of the CT data used for model training, validation and testing is shown in Figure 1. BodySegAI was trained on a total of 2989 CT slices from 286 subjects recruited from three independent studies (Table 1). The bulk part of the training data set contained pre-operative and post-operative scans of CRC patients in the randomized controlled trial CRC-NORDIET. 28 From 153 CRC patients, we acquired mid-vertebral single CT slices from L2 to S1, in addition to volumes (20% of the distance from the iliac crest towards the lowest point of the mandible, corresponding to the abdominal volume defined for dual-energy X-ray absorptiometry (DXA) by the International Society for Clinical Densiometry 29 ). From two previous studies, we also acquired single slices from 94 healthy 30 and 39 diabetic subjects. 31 A portion of the training data set was used to validate the model (number of slices = 300) and was later reinserted into the training data set for training purposes ( Figure 1). All CT examinations were performed between 1991 and 2020. Thus, BodySegAI was trained on multi-centre CT slices from several various scanners and protocols with different contrast phases, slice thicknesses, tube currents, and voltages.

Test data set and human ground truth
An independent test data set consisting of 154 post-operative CT slices from 32 CRC patients from the CRC-NORDIET study 28 was created to test the performance of BodySegAI ( Table 2). The median age and BMI in the test data set were 66 years and 27.6 kg/m 2 . Fifty-three per cent of the population were men ( Table 2). The slice thickness was 3 mm for all the test data. The test data set was designated for testing exclusively and had never been introduced to BodySegAI before testing. Performance of BodySegAI was tested against human ground truth and compared with AutoMATiCA: a CNN-based opensource software for automatic segmentation at level L3.   AutoMATiCA was trained on L3 slices from renal and liver donors, patients that were critically ill or had liver cirrhosis, pancreatic cancer, or clear cell renal carcinoma. 17 The human ground truth was created as a reference to the test data set. We used 'Simultaneous truth and performance level estimation' (STAPLE)-an expert-maximization algorithm that provides a probabilistic estimation of the true segmentation. 32 The STAPLE-based human ground truth was established by combining the segmentations performed by three experienced human readers A-MF (radiographer), PML (radiologist), and DHA (registered dietitian) hereafter referred to as Reader 1, 2, and 3, respectively. Each of the three human readers segmented 154 CT slices included in the test data set in sliceOmatic.

Image segmentation
SM, IMAT, VAT, and SAT was segmented using sliceOmatic Version 5.0 Revision 7 (TomoVision, Montreal, Canada). All segmentation was performed in accordance with the Alberta Protocol as defined and used in body composition segmentation studies performed at the Alberta hospital. 33 Segmentation of the training data set was performed by four radiographers, two registered dieticians, one oncologist, and one radiologist. For the segmentation tool using thresholding the following values were used: SM: À29 to 150 Hounsfield Units (HU), IMAT: À190 to À30 HU, VAT: À150 to À50 HU, and SAT: À190 to À30. In case of beam hardening artefacts caused by metal implants or colonography contrast media, the thresholding was turned off to enable correct segmentation. We defined the lowest vertebra above the sacrum as L5.

Statistical methods
Statistical analyses were performed in SPSS (IMB SPSS Statistics 27). The performance of BodySegAI was tested against the human ground truth. Dice score coefficients, Hausdorff distances, sensitivity, and specificity were calculated. The Dice score calculates the overlap between two volumes, defined as 2 Â True positive 2xTrue positive þ False positive þ False negative pixels Hausdorff distance is the maximum of distances in mm between points in two point sets, calculated by the following formula: Hausdorff distance(A,B) = max(h (A,B), h(B,A)). 34 Main results were given in median and 25 and 75th percentiles as data were non-normally distributed, unless otherwise specified. The median absolute error (%) was calculated as the median of the absolute difference in percent between BodySegAI and human ground truth, divided by human ground truth. Bland-Altman plots were used to quantify the mean differences per slice between segmentation methods. 35 The y-axis is defined as the difference between methods (BodySegAI À human ground truth), and the x-axis represents the mean values ((BodySegAI + human ground truth)/2). Upper and lower limits of agreement were defined as mean difference ± 1.96*standard deviation (SD) (corresponds to 95% confidence intervals). BodySegAI was additionally compared against AutoMATiCA. The performance was investigated both for single slices alone and all the slices combined. Time effectiveness was presented as the average time (in seconds) with SD for segmentation of CT scans selected by chance by Readers 1, 2, and 3 against BodySegAI. The difference was tested using the independent samples t-test. A computer with an Intel® Core ™ i5-8500 processor, Intel UHD Graphics 630,16GB of RAM, and Windows 10 operative system was used.

BodySegAI compared with human ground truth
The median Dice score for BodySegAI compared with the human ground truth was 0.969 for SM, 0.814 for IMAT, 0.986 for VAT, and 0.990 for SAT. Dice scores, Hausdorff distances, sensitivity, and specificity of BodySegAI compared with human ground truth using the test data set are shown in Table 3. Bland-Altman plots comparing the segmentation volume between BodySegAI and human ground truth per slice are shown in Figure 2. The mean differences between BodySegAI and human ground truth were À0.09 (limits of agreement: À2.37-2.19) for SM, À0.17 (À1.69-1.36) for IMAT, À0.12 (À1.34-1.09) for VAT, and 0.67 cm 3 (À1.03-2.36) for SAT. All Bland-Altman plots showed an even distribution of the differences above and below the mean difference.
The performance of BodySegAI against human ground truth for different anatomical slice locations is shown in Table 4. Although L2, L3, and L4 demonstrated similar Dice scores, L3 presented the overall highest median Dice, and S1 demonstrated the lowest.
The Dice score of each human reader vs. the STAPLE based human ground truth for L2 to S1 is presented in Table 5. Figure 3 shows sample cases of the performance and typical errors in segmentation by BodySegAI.    Values are median (25 and 75th percentiles). IMAT, intermuscular and intramuscular adipose tissue; SAT, subcutaneous adipose tissue; SM, skeletal muscle; VAT, visceral adipose tissue.

Figure 3
Sample cases demonstrating the segmentation performance by BodySegAI, by human ground truth and the unsegmented CT slice. Dice score, the difference between BodySegAI and human ground truth and the mean total body compartments is shown. The CT slices with the green arrows show (A) particularly good performance by BodySegAI, (B) a mistake in the segmentation of musculus transverus abdominis by BodySegAI, (C) error in the segmentation of patient with ostomy by BodySegAI, and (D) error in the segmentation of VAT by BodySegAI. Skeletal muscle is segmented in red, intermuscular and intramuscular adipose tissue in green, visceral adipose tissue in yellow, and subcutaneous adipose tissue in blue. Mean difference represents BodySegAI minus human ground truth, and total is the segmented volume by human ground truth.

BodySegAI compared with AutoMATiCA
When testing our test data set against human ground truth, AutoMATiCA showed lower Dice scores than BodySegAI for all body compartments in L2 to S1 (Table 3). AutoMATiCA also showed higher Hausdorff distances compared with BodySegAI in SM, VAT, and SAT. Median Dice scores by AutoMATiCA for the L3 slice only was 0.962 for SM, 0.843 for IMAT, 0.986 for VAT, and 0.986 for SAT (slices tested = 31).

Time effectiveness of BodySegAI
The average time used for automatic segmentation of one slice by BodySegAI was 4.9 seconds (SD: 0.7, number of slices = 20). Semi-manual segmentation by Reader 1, 2, and 3 required on average 723.5 s (SD: 232.3, number of slices = 99) and BodySegAI segmented on average 148 times faster (P < 0.001) than human readers.

Discussion
The deep learning-based segmentation tool BodySegAI showed excellent segmentation performance for SM, VAT, and SAT (Dice scores: >0.969) and good results for IMAT (Dice score: 0.814) in multiple abdominal CT slices in CRC patients. These results are in line with other deep learning models tested on cancer patients, only using a single slice at L3 or L4. 14,16,20,23,24 In the present study, BodySegAI demonstrated higher Dice scores than AutoMATiCA when using our test data set. However, in the original article by Paris et al., AutoMATiCA outperforms BodySegAI's Dice scores in the present study for SM and IMAT, but not for VAT and SAT. 17 This suggests that the CT test data used in our study had different characteristics from the data used to train AutoMATiCA. Lack of generalization is a well-known challenge in deep learning. This may be improved by training the model on heterogeneous data spanning the full range of image characteristics in clinical use. 11 BodySegAI was trained on data from several institutions using different CT scanners and protocols, introducing variation itself. Additionally, the CT slices in the training data set originated from three independent groups of subjects with heterogenic characteristics (i.e. CRC, healthy, and diabetic). Such heterogeneity in imaging data and study population may facilitate a more robust model and better generalizability.

Multi-slice modelling
A single slice at L3 is frequently used in research settings because of its high association with whole-body measurements 36,37 and abdominal volume. 38 We have previously shown that although a single slice at L3 is highly associated with abdominal VAT volume, the use of multiple slices (L2, L3, and L4) increased the explained variance against VAT volume. 38 However, BodySegAI was trained and tested on multiple slices at different anatomical levels (L2 to S1). This may have increased the model's robustness to anatomical changes after abdominal surgery and enables segmentation of data from other anatomical levels in cases where the L3 slice is unavailable or contains artefacts. A few models have targeted a multi-slice approach, and BodySegAI demonstrates similar or higher Dice scores than these. 13,18,22,25,39 Limitations Although our model is trained on multi-centre data, it is tested on a sample of post-operative CT scans from CRC patients. BodySegAI should be tested on other patient groups and settings to assess its external validity. By visual inspection of the segmented CT slices, we observed that BodySegAI under-performs in CT slices that have abnormalities, such as oedema, ostomy, or artefacts. The training set consisted mostly of lumbar slices, which may explain the lower performance at S1 level. This is important to keep in mind if the model is to be utilized at a sacral level. BodySegAI was trained on various slice thicknesses (3-5 mm), but it was only tested on CT slices with 3 mm thickness. Thus, the effect of slice thickness on segmentation performance is unknown. Due to the comprehensive task of semi-manual segmentation for establishing the human ground truth and lack of standardized regions of interest, BodySegAI was only tested on single CT slices from predefined anatomical regions (L2 to S1), not entire abdominal volumes. However, the model can readily be used for volumetric data, which is proposed to further increase the accuracy compared with the usage of single slices only. 38 Although we consider BodySegAI as an automatic software, the axial CT slices still have to be manually extracted from the CT examination and uploaded into BodySegAI in order for the body composition analysis to be conducted.

Strengths
To our knowledge, this is the first software developed to automatically segment SM, IMAT, SAT, and VAT in multiple CT slices from L2 to S1 in CRC cancer patients staged TNM I-III. Due to the time-consuming task of semi-manual segmentation by humans, most previous studies have only used one or two readers to establish the ground truth or reference method. 17 Although the main variation in segmentation results is due to subject variation and not inter-rater segmentation variability, 40 a single readers subjective performance may influence the reference standard or ground truth. By using STAPLE, we reduced this vulnerability by providing a ground truth representing the statistically optimized combination of segmentations from several human readers. 32 Our results also showed that each of the three readers contributed almost equally to the final STAPLE-based human ground truth (Table 5).

Clinical implications
There is increasing evidence of the importance of body composition on health outcomes in cancer patients; nevertheless, there is a lack of tools for clinical implementation. Other methods for the assessment of body composition include DXA and bioelectrical impedance analysis. However, these methods are often unavailable at the hospitals and less precise and require additional resources to perform compared with using the existing CT scans. 10 In clinical practice, weight loss in kilograms or BMI is therefore often used to identify conditions such as malnutrition, sarcopenia, or cachexia, despite not being able to distinguish between fat and muscles.
In comparison with the existing and routinely used methods to estimate body composition or weight, we consider the general uncertainty of BodySegAI to be low. This can be demonstrated by the narrow limits of agreement against human ground truth in the Bland-Altman plots. Median absolute errors (%), which were below 2% for total SM, VAT, and SAT and 10.5% for IMAT also support the high accuracy of BodySegAI. Consequently, we consider the performance of BodySegAI to be acceptable for research purposes and to exceed the methods for body composition evaluation in clinical practice today.
A number of patients perform CT scans as routine care, and we suggest integrating BodySegAI into the clinical workflow. Body composition data could be used to track changes and risk stratify patients, enabling the diagnosis and treatment of malnutrition, sarcopenia, and cachexia. It may also facilitate large-scale studies to increasingly utilize CT-based body composition data and then generate larger sets of normative data to use as reference curves in the clinic.

Conclusion
BodySegAI generates excellent and detailed segmentations of SM, VAT, and SAT and good segmentations of IMAT, 148 times faster than human readers. BodySegAI may replace the semi-manual segmentation presently used in research after validation on local data. We suggest that BodySegAI may be integrated into a future clinical workflow for patients that perform CT as standard care.