The Impact of Fatty Infiltration on MRI Segmentation of Lower Limb Muscles in Neuromuscular Diseases: A Comparative Study of Deep Learning Approaches

Deep learning methods have been shown to be useful for segmentation of lower limb muscle MRIs of healthy subjects but, have not been sufficiently evaluated on neuromuscular disease (NDM) patients.

assess the natural history of a disease or evaluate the efficiency of a therapeutic strategy. 2These potential biomarkers could also be helpful for a better understanding of the disease mechanisms and supportive diagnosis. 3Over the last decade, fatty infiltration has been recognized as a ubiquitous phenomenon in many neuropathies and dystrophies. 4Fatty infiltration is a progressive replacement of muscle tissue by fat, which can be identified using quantitative MRI approaches quantifying fat fraction (FF). 5FF has been shown to be more sensitive to disease progression than clinical or myometric measurements. 68][9] While the clinical potential of these MRI biomarkers has been widely documented, utilization in clinics has not been reached so far.This limited use is related to the availability of sequences and the variability of scans but also to the fact that biomarker quantification requires a preliminary step of delineation (segmentation) of the regions of interest (i.e.individual muscles).4][15][16][17] It has been largely recognized that manual segmentation is not an option in this specific context. 18,19The corresponding task is time-consuming and suffers from operator dependency. 202][23][24][25][26] However, the high inhomogeneity of fatty infiltration among individual muscles regarding patients and diseases is a major concern for those methods. 18,19As a matter of example, in Charcot-Marie-Tooth (CMT) patients, FF can vary from 0% to 81%. 11A semi-automated method based on contour propagation of a few manually segmented slices provided interesting results, but manual entry can still be considered prohibitive for largescale clinical use. 27,282][23][24][25][26] These studies have shown promising segmentation results on healthy subjects, but few have investigated severe fat infiltration.Three studies reported results in severely infiltrated patients. 21,26,29Although each study reported a difference in outcomes between moderate and severe infiltrations, the evolution of CNN performance as a function of FF was not investigated.In addition, the potential of segmentation methods was evaluated based on similarity between manual and automatic segmentations but not on the validity of quantitative MRI biomarker measurements.To date, only Chen et al and Ding et al evaluated the accuracy of quantitative MRI biomarkers using fully automatic segmentation.However, the evaluation was performed on a homogeneous base of lightly infiltrated subjects, as evidenced by their FF measurements. 23,24Overall, the robustness of CNN-based segmentation methods with respect to the extent of fat infiltration remains unknown.
1][32][33] The obvious choice was U-Net 2D, the reference CNN for medical image segmentation. 30U-Net 3D, a variant of U-Net for processing volumes, was selected to compare the 2D and 3D approaches. 31Two other networks, transUNet and HRNet, were selected, as they address the recognized shortcomings of U-Net.TransUNet allows evaluating the relation between the different elements of the image thanks to the presence of the vision transformer module. 32RNet overcomes the loss of information due to the compression of the image in the U-Net encoder. 33he objective of this study was to evaluate the performance of these four networks, as a function of fatty infiltration.To enrich the training and evaluation of CNNs, a database consisting of patients from three NMDs and controls was collected.

Standard Protocol Approvals, Registrations, and Patient Consents
The study was approved by the local research committee and was conducted in conformity with the Declaration of Helsinki (version October 2013) and the Medical Research Involving Human Subjects Act.Prior written informed consent was obtained from all subjects.Each patient provided an informed consent for a retrospective analysis of the MR images recorded as part of the research protocol they volunteered for.

Subjects
Data were collected from a hospital database collecting the work of three previous studies, on three different diseases, from the reference center for NMD and ALS at the university hospital of La Timone. 12,34The cohort consisted of 67 patients with NMD and 14 controls (age: 53 AE 17 years, sex: 48 M, 33 F).The patients included 29 familial amyloid polyneuropathies (FAP), 18 CMT diseases, and 20 chronic inflammatory demyelinating polyneuropathies (CIDP).There was no recruitment or inclusion processes, since the data were only collected from previous studies, which had their own recruitment/inclusion process.A few patients have been scanned several times and the whole set of images have been integrated in the training database, to ensure proper training of the CNNs.The corresponding MRI dataset was composed of 218 MRI volumes (112 thighs and 106 legs).
Briefly, familial amyloid polyneuropathy (FAP) is a rare genetic disorder with autosomal-dominant inheritance due to a mutation in the transthyretin (TTR) gene, which causes a rapid progressive polyneuropathy. 35All subjects had a confirmed mutation in the TTR gene, with 25 symptomatic patients and 14 presymptomatic carriers.CMT disease is the most common cause of hereditary neuropathy. 36All patients from our cohort were genetically confirmed as CMT1A patients with a classic mutation in the PMP22 gene.The third type of patient was composed of chronic inflammatory demyelinating polyneuropathy (CIDP), an acquired immune-mediated neuropathy characterized by a sensory-motor impairment.All CIDP patients fulfilled the definite clinical and electrophysiological European Federation of Neurological Societies (EFNS)/Peripheral Nerve Society (PNS) criteria for CIDP. 37No patient had any history of other neuromuscular condition.The control group was composed of individuals with no medical history of neuropathic or muscular disease.
The segmentations were performed by neurologists who had participated in the various research protocols from which the data were gathered, and each had at least 5 years of segmentation experience (E.F., C.P.M.).The segmentations were also reviewed by another nonclinical operator with 3 years of experience in manual segmentation (M.-A.H).
Manual segmentation was performed on T 1 w images.The raters only segmented a limited number of slices, depending on the patient, and a semi-automatic method using a combination of diffeomorphic registrations was used to propagate these segmentations to the remaining slices. 27,28The final segmentations were checked by the same observers, to correct the propagated masks if needed.

Implementation
CNN architecture appears in Fig. 1.U-Net 3D is not presented since its architecture is very similar to U-Net 2D, with 3D convolutions instead of 2Ds.The network is four-layered and the number of channels is 24/48/96/192 for the encoder, and the decoder is built symmetrically.The volumes were padded with replicated slices on the extremities to create a 320 Â 320 Â 24 volume, making it easier to divide the number of slices per two at each layer.The characteristics of the other neural networks are described in Fig. 1.
A Python 3.8.3environment was used to implement the CNN training with PyTorch 1.11.0Experiments were run on a Linux Xeon Silver Workstation (4214cpu@2.2GHz-96 Gb) with a Nvidia GeForce RTX 3090 GPU.
The T 1 w images were used for the CNNs training and consisted of 112/106 thigh/calf volumes for the 3D set and 2240/2120 thigh/calf images for the 2D set.The validation set represented 10% of the training set.Networks were individually trained using 10-fold cross validation.The loss function was the Dice loss, a standard function for image segmentation library for deep learning.The optimization algorithm was the PyTorch version of Adam. 38Each network was trained with an early stopping strategy with patience of 10 epochs.No data augmentation was performed, as the addition of random rotation and shift did not result in improved performance of the CNNs.

Evaluation
The performance of each network was compared to the ground truth (manual segmentation) based on geometric similarity metrics and MRI biomarkers agreement.The geometric metric included the commonly accepted, that is, DSC, an index of segmentation quality ranging from 0 (no overlap) to 1 (total overlap).In addition, for each metric, the outlier rate (OR) and the rate of unidentified muscles (AR) were computed.The OR was calculated as the rate of metric values that are lower to the threshold value Q1-1.5*IQR with IQR = Q3-Q1, Q1 being the lower quartile and Q3 the upper quartile of the corresponding metric.The AR represents the ratio between the number of muscles volumes identified by operators and the number of muscles detected by the CNNs.
From the quantitative MRI maps (FF, MTR, T2), the values of each quantity were calculated as the average of the values of the corresponding map over the entire volume.For each metric, a prediction error score ΔX was computed to represent the difference between the measured values with the masks from automatic segmentation X a and ground truth X m , where X stands for FF, MTR, T2 or volume.

Statistical Analysis
A benchmark analysis of the performance of the segmentation methods was conducted.This was based on two criteria: 1) average similarity between automatic and manual segmentations through DSC, OR and AR; and 2) average measurement bias of the MRI quantities (ΔFF, ΔMTR, ΔT2, and ΔV).
Based on this comparison, the measurement bias of the best model was investigated regarding the degree of fatty infiltration, represented by the FF, using a Bland-Altman plot.
The impact of fat infiltration on performance was also investigated by studying the differences in accuracy between two subgroups of muscles with FFs less (G20À) and greater (G20+) than 20%.The 20% threshold was chosen since it was approximately the maximum value on which CNNs segmentation has been evaluated in former studies. 23,24To test the difference in accuracy between G20+ and G20À, the distribution of samples was initially evaluated using the Shapiro-Wilk test.Differences were then assessed using non-parametric Wilcoxon pairwise tests or parametric Student's t-tests.The significance level was set at P < 0.05.

CNN Training
Training time ranged from 2 hours (U-Net 2D) to 3 hours (HRNet) per network.To perform 10-fold cross validation, the total time was thus between 20 hours for U-Net 2D and 30 hours for HRNet.
As can be seen in Fig. 3, DSC values were high, with the largest value obtained for Ad (0.95 AE 0.01) and the lowest for GM (0.86 AE 0.09).The smallest FF error was observed for So (0.32 AE 0.33) and the largest for GM (1.19 AE 1.16).ΔMTR reached the minimal value for So (0.28 AE 0.27) and the maximal for Gr (1.09 AE 0.98).Regarding the ΔT2 (msec), So was the muscle with the smallest error (À0.81 AE 0.90), whereas the largest error was observed for Gr (À2.95 AE 2.39).The smallest volume error ΔV(cm 3 ) was found for Gr (3.37 AE 3.14) and the largest for So (15.66 AE 14.11).
The Pearson correlation coefficient between DSC and each biomarker metrics did not exceed À0.45.
Regarding identification error, HRNet reached an AR score of 0%, whereas this score ranged from 1.39% to 4.04% for the other networks (Table 1).The OR scores slightly varied between networks and were between 4.53% and 5.77% for the thigh and 6.39% and 7.52% for the leg.The rate of outlier was about 2% higher for the leg muscles as compared to the thigh muscles.Figure 4 shows an illustration of missing muscle segmentation, representing the AR, and examples of errors in muscle contour detection, representing the OR.The AR score is illustrated by an example of missing muscle segmentations in Fig. 4b.Similarly, the OR score is highlighted by examples of poor muscle contour detection in Fig. 4c,d.

Robustness to Fat Infiltration
The sensitivity of measurement errors to fat infiltration is illustrated by the Bland-Altman plot shown in Fig. 5. Considering that the whole set of networks performed similarly well and for the sake of clarity, Bland-Altman plots are only presented for HRNet, the CNN for which the whole set of muscles was identified.
As indicated by the confidence intervals, the reliability of the measurements ranged between [À2.12, 1.49] (%) for ΔFF, [À1.32, 1.7] for ΔMTR, and [À4.36, 3.26] (msec) for ΔT2.The reliability for volume error was smaller with a larger confidence interval equal to [À7.58, 6.81] (cm 3 ).The mean error and standard deviation for each metric were consistently larger for the G20+ group than the G20 group (Table 2).In addition, the statistical study revealed that the increase in error was significant for each metric (P values <1.00 Â 10 À3 ).

Discussion
In this study, the performance of CNNs for automatic segmentation of individual muscles was evaluated.2D U-Net, 3D U-Net, transUNet, and HRNet were selected from the state of the art to perform this task.The models were tested on a large database heterogeneous in degree of fat infiltration, composed of patients from three different NMDs and a population of controls.A comparison of the results was made according to the similarity of the predicted segmentations with the manual references (DSC, AR, and OR) as well as the agreement between the MRI quantities measured with the segmentations of the CNNs and the references (FF, MTR, T2, volume).The results of the best model, HRNet, were studied against the degree of infiltration, that is, FF.Finally, the muscle population was divided into two infiltration subgroups (G20À, G20+) to perform a statistical comparison of HRNet performance on each group.
The accuracy of the automatically predicted segmentations, illustrated by DSC values, was slightly better than the values reported in the literature using CNNs (from 0.77 to 0.93) 21,23,24,26 and similar to the values reported using non learning semi-automatic methods (0.90 AE 0.03). 27It is noteworthy that for the latter method, manual segmentation of a few slices was needed as a preliminary step, and this prerequisite might be seen as a limitation for large clinical applications. 18Our method was consistent, as the results obtained on our heterogeneous database in fatty infiltration (ranging from 1.87% to 64.34%) were comparable with the scores reported in the literature on a slightly infiltrated set of subjects (<20%). 24,32Moreover, our results were superior to those obtained on a set of severely infiltrated subjects in the studies of Rohm et al (0.85 AE 0.08), Agosti et al (0.87), and Gadermayr et al. (0.88). 21,26,29Although of interest, DSC cannot be considered as a standalone index for assessing the performance of a segmentation algorithm in a clinical context.While DSC informs about the overlapping between a predicted and a manually segmented mask, one must quantify clinically useful indices such as FF, MTR, and T2.Rather counter-intuitively, DSC values in our evaluation were poorly correlated with biomarkers quantification errors.In other words, a high DSC value would not be indicative of a reduced error regarding other metrics of interest.For example, a segmentation mask with a high DSC could properly cover a muscle region.However, if it also covers some other voxels around the muscle, that is, intermuscular fat, then the FF could be highly biased.Similarly, if the DSC is low, but the segmentation is in the central part of a muscle with a homogeneous infiltration, quantification of biomarkers would not be biased, and the corresponding results would be close to those from the ground truth.Thus, from a clinical outcome perspective, the most relevant performance indicator in segmentation studies may be biomarker quantification.
In a clinical context and more specifically in the field of neuromuscular disorders, CNNs are expected to provide a high-quality segmentation of individual muscles, which could be used to compute MRI biomarkers with a very high accuracy.This is of utmost importance if one intends to use MRI biomarkers to follow-up the natural history of muscle diseases or to assess the efficiency of a therapeutic strategy in a short time window.The corresponding errors quantified in the present study were relatively low, À0.3% AE 1.0% for FF, 0.2 AE 0.8 for MTR, and À0.55 AE 1.94 msec for T2.Of interest, the errors were lower than the changes reported over a 12-month period in CMT1A (FF: 1.1% AE 2.4%, T2: 1.4 AE 2.6 msec, MTR: 1.1 AE 2.4). 7This result clearly indicates that CNN-based segmentation could be used to compute MRI biomarkers of interest and characterize subtle changes in longitudinal follow-up studies.This would imply that the infiltrated images would be correctly segmented as well.
The accuracy of quantitative MRI biomarker quantification based on fully automatic segmentation methods has been poorly evaluated in the literature.Only two studies have reported FF quantification in individual muscles using U-Net-based segmentation and the corresponding assessment was performed in a limited number of patients, that is, 24 and 4, respectively. 23,24While Ding et al reported a 0.17% systematic bias, Chen et al reported a confidence interval of [0.56%, 0.49%] for the thigh, and [0.71%, 0.84%] for the leg muscles. 23,24FF errors computed in our entire database were slightly larger than theirs with a systematic bias of 0.28% and a CI of [2.12%, 1.49%].However, it should be kept in mind that the infiltration range in our database (<64%) was much larger than those in the quoted studies (<20%) and this is likely to have a detrimental effect on the quality of segmentations.By selecting only muscles with 0%-20% infiltration, the biomarker estimation results (systematic bias: 0.20%, CI: [1.38%, 0.976%]) were closer to the values reported by Ding et al and Chen et al. 24,32 Comparative analysis between systematic errors obtained in muscles with FF values below (G20À) and above 20% (G20+) conducted in our database indicated that the segmentation and accuracy of the corresponding biomarker were negatively affected by high FF values.This effect was particularly noticeable on volume, where the error was increased 10 times on severely infiltrated muscles.This distorting effect of FF could be related to the fact that fat infiltration could affect the visibility of muscle boundaries.
The volumes of muscles segmented by CNN were lower on average than the corresponding volumes calculated from manual segmentations.In other words, the volume of CNN segmentations was significantly underestimated on the most infiltrated muscles.
Detailed inspection of the individual MR images in our cohort suggests that the FF value is not the only factor responsible for the quantification bias.The pattern of fatty infiltration also appears to play a role.Although this aspect warrants further study, sparse infiltration would not prevent the detection of muscle contours and would not bias the quantification of muscle volume.On the contrary, a more compact infiltration would have a more detrimental effect.
In terms of geometric efficiency, the different CNNs performed almost equally well, with minimal differences regarding DSC values.All the CNNs, apart from the HRNet, failed to identify some muscles.The non-identification corresponds to the fact that the CNN associates the absent muscles to the background of the image, thus an improvement track could be explored in this direction.Since the identification of individual muscles is essential for the correct quantification of biomarkers, this study identifies HRNet as the most appropriate network for the segmentation of muscle images.

Limitations
This study was performed with data from a single center and a single scanner.It would be relevant to compare the performance of the CNNs on another database, from another center, and on other neuromuscular diseases (eg facioscapulohumeral muscular dystrophy).As the accuracy of deep learning methods is highly dependent on the nature of the training data, a transfer learning approach might be required to achieve the same results. 39any neural networks could have been used in this study.Among the great variety of CNNs, we have chosen U-Net, the standard for medical image segmentation, as well as variants using a transformer module and 3D processing.To diversify our approach, we have chosen to test a different architecture from the encoder-decoder scheme with HRNet.We believe that the choice of these four networks was sufficient to evaluate the effect of fat infiltration on the automatic segmentation.Testing other architectures may be done in the future, but would be beyond the scope of this study.
The reliability of our approach was assessed from a comparative analysis with manual segmentation.Although the manual segmentation strategy was carefully detailed in guidelines, previous comparative analyses have indicated that DSC values computed for segmentations performed by different observers ranged from 0.80 to 0.95. 27We considered that this variability regarding manual segmentations was of interest given that it provided an additional source of heterogeneity which might be learned by the CNNs.However, this necessarily implies that without absolute ground truth, the goal of a fully automated segmentation method should be to achieve an accuracy that matches the inter-operator variability.

Conclusion
All four networks tested in this study provided high-quality segmentations on FAP, CMT, and CIDP patients.That indicates/illustrates they could be used for accurate quantification of biomarkers.Although we identified a biasing effect of fat infiltration on biomarker accuracy, it was still acceptable compared with the 12-month patient biomarker trends, demonstrating the potential of follow-up studies.

FIGURE 1 :
FIGURE 1: (a) Networks architecture including number of channels and number of convolutional layers with U-Net 2D, TransUNet, and HRNet.D: dimension of the square image, C: number of channels.(b) Processing pipeline composed of 1) training CNNs on T1w image database; 2) predicting segmentations with one of the four trained CNNs; 3) applying predicted segmentations to each qMRI maps to; 4) extract scores for each biomarker to compare them with those from manual segmentations.