Lumbar spine segmentation method based on deep learning

Abstract Aiming at the difficulties of lumbar vertebrae segmentation in computed tomography (CT) images, we propose an automatic lumbar vertebrae segmentation method based on deep learning. The method mainly includes two parts: lumbar vertebra positioning and lumbar vertebrae segmentation. First of all, we propose a lumbar spine localization network of Unet network, which can directly locate the lumbar spine part in the image. Then, we propose a three‐dimensional XUnet lumbar vertebrae segmentation method to achieve automatic lumbar vertebrae segmentation. The method proposed in this paper was validated on the lumbar spine CT images on the public dataset VerSe 2020 and our hospital dataset. Through qualitative comparison and quantitative analysis, the experimental results show that the method proposed in this paper can obtain good lumbar vertebrae segmentation performance, which can be further applied to detection of spinal anomalies and surgical treatment.


INTRODUCTION
Over the past few decades, various medical imaging modalities, including x-rays, computed tomography (CT), magnetic resonance imaging (MRI), and positron emission tomography (PET) have been widely used to examine spinal structural abnormalities. 1,2 CT imaging technology has been used to intuitively observe the internal information of the human body and count the volume information of various tissues due to its advantages of fast scanning speed and clear image generation. 3 Accurate lumbar vertebrae segmentation is an important step in subsequent analysis and treatment,including examining abnormal vertebrae fractures, biomechanical modeling, lumbar inter-body fusion, and anterior and posterior lumbar inter-body fusion. 4 particular, these analyses and treatments place high demands on the accuracy of lumbar inter-body segmentation. Before the maturity of artificial intelligence technology, the segmentation of vertebral bodies was mostly done manually by multiple experienced radiologists. However, the artificial vertebrae delineation method is labor-intensive and time-consuming, and the delineation standards of different radiologists are inconsistent, resulting in large differences in the segmentation results of the vertebrae. Therefore, an automatic and efficient automatic segmentation method of vertebrae is of great significance for the subsequent morphological analysis and clinical treatment.
The essence of automatic segmentation of vertebral bodies from CT images is to segment the vertebraes in CT images to provide a reliable basis for clinical diagnosis and treatment research. It can be understood as a pixel-level classification method. Traditional image segmentation methods such as threshold-based and region-based segmentation can be applied to medical image segmentation. However, due to the influence of imaging equipment,imaging principles and other factors, the content and shape of medical images are relatively complex, and these traditional methods still have great challenges in segmentation accuracy.
Recently, significant progress has been made in image segmentation technology based on deep learning mainly with convolutional neural networks. There are many segmentation architectures with good performance, such as fully convolutional neural (FCN), 6 SegNet, 7 U-Net, 8 and 3D UNet, 9 etc. The networks used in many image segmentation studies are based on the above networks or their corresponding 3D versions. The U-Net 8 model adopts a symmetrical U-shaped structure and achieves good results in many medical segmentation tasks, but its skip-layer connection forces the encoder and decoder to perform feature fusion only on the same depth layer. Neglecting the fusion of semantic information with different scales will result in the loss of certain details. UNet++ 10 optimizes according to the structure of the UNet model, alleviates the unknown network depth with an efficient ensemble of U-Nets of different depths, redesigns skip connections to aggregate features of different semantic scales at the decoder sub-networks, and designs a pruning scheme to accelerate the inference speed of UNet++. 10 UNet3+ 11 utilizes full-scale skip connections and deep supervisions. DeepLab V1 12 uses VGG16 13 as the Backbone, and uses atrous convolution and conditional random fields to improve the classification accuracy. DeepLab V2 14 uses ResNet-101 15 as Backbone, uses atrous convolution more flexibly, and proposes atrous space pyramid pooling. DeepLab V3 16 abandons conditional random fields, improves atrous space pyramid pooling and uses atrous convolution to deepen the network.
Aiming at the difficulties of lumbar vertebrae segmentation in CT images, we propose a two-stage automatic lumbar vertebrae segmentation method based on deep learning. The proposed method mainly includes two stages of lumbar vertebra positioning and lumbar vertebrae segmentation. First of all, we propose a lumbar spine localization network of Unet network, which can directly locate the lumbar spine part in the image. Then, this paper proposes a three-dimensional XUnet lumbar vertebrae segmentation method to achieve automatic lumbar vertebrae segmentation. Different from existing methods, a novel low-parametric nature of Inception block is introduced to capture the rich feature information in the encoder part and multi-scale features are fused in the decode part to taking into account that the surrounding context to enable sound segmentation performance. The method proposed in this paper was validated on the lumbar spine CT images on the public dataset VerSe 2020 and our hospital dataset. 17 Through qualitative comparison and quantitative analysis, the experimental results show that the method proposed in this paper can obtain good lumbar vertebrae segmentation performance, which can be further applied to detection of spinal anomalies and surgical treatment.

RELATED WORKS
In recent years, various machine learning algorithms have also been widely used in the field of image segmentation. Jimenez-Pastor et al. 18 introduced a two-stage decision forest and morphological image processing technique to automatically detect and identify vertebral bodies in arbitrary field-of -view volume CT scans. Combining the convolutional neural network with the deformation model, Korez et al. 19 learn the spine features and output the probability map of the spine through the convolutional neural network, so as to guide the deformation model to generate the boundary of the spine, and finally realize the spine segmentation.
Benefiting from the improvement of computational resource performance and the increase of datasets available for training, deep learning is widely used in image segmentation. Model-based spine segmentation methods are gradually being replaced by data-driven deep learning methods as public spine image datasets become more sophisticated. Sekuboyina et al. 20 used two convolutional neural networks for spine localization and segmentation, respectively. The input of the localization network is a two-dimensional slice of the spine image,and the ground-truth of the localization network is obtained by downsampling the real spine mask, using patches to segment the vertebrae one by one. Similarly, Janssens et al. 21 used two three-dimensional FCN 6 networks for spine localization and segmentation. The localization network uses regression to obtain the bounding box of the spine to localize the spine. The segmentation network directly performs voxel-level multi-classification. Lessmann et al. 22 used only one three-dimensional FCN 6 for spine segmentation. They use an iterative method to segment the vertebrae one by one according to the prior rules of the appearance of vertebrae, and determine whether the vertebrae in the current block have been segmented by adding a memory component. In addition, there is a classification component dedicated to labeling vertebrae that have been segmented. However, the above-mentioned method for segmenting the spine using blocks is difficult to solve the problem of overlapping between the vertebrae. Paye et al. 23 proposed a multi-stage method for spine segmentation, first using 3D U-Net to roughly localize the spine,then using SpatialConfiguration-Net 24 to perform heat map regression for spine localization and identification, and finally using 3D U-Net to perform binary segmentation on the identified vertebrae. The above deep learning-based spine segmentation research has achieved good segmentation results.

Method overview
Aiming at the difficulties of lumbar vertebrae segmentation in CT images, we propose a novel two-stage lumbar vertebrae segmentation method based on deep learning. The method mainly includes two stages of lumbar vertebra positioning and lumbar vertebrae segmentation. First of all, we propose a lumbar spine localization network of Unet network, which can directly locate the lumbar spine part in the image, and crop out the image containing the lumbar spine, so as to improve the performance of subsequent segmentation. Then, we propose a three-dimensional XUnet lumbar vertebrae segmentation method with an encoder-decoder framework to achieve automatic lumbar vertebrae segmentation. At the encoder part, a novel low-parametric nature of Inception block is introduced to capture the rich feature information. At the decoder part, the multi-scale features are fused in the decode part to taking into account that the surrounding context to enable sound segmentation performance.

3.2.1
UNet-based localization network model In order to obtain the accurate positioning of the lumbar spine, this paper proposes a positioning network model based on UNet, as shown in Figure 1. The network model structure is as follows: , 4] is obtained after feature extraction by the encoder, then the decoding feature Y i can be expressed as: where downsampl is downsampling, upsample is upsampling, is the activation function, f 3 * 3 * 3 means that the convolution operation is performed with a convolution kernel of 3*3*3 size.
1. According to the method described in Equation (1), multi-scale feature fusion is realized in the decoding process. The specific flow of data in the network is as follows: The first layer of the encoder: the input scale is 96*96*128*1, and the feature map X 1 is obtained after two convolutions to extract the feature, and the feature map X ′ 1 with 64 channels and the size of 48*48*64 is obtained after one down sampling; The second layer of the encoder: the input feature map X 1 ′ with a scale of 48*48*64, the feature map X 2 is obtained after two convolutions to extract the feature, and the feature map X 2 ′ with 64 channels and a size of 24*24*32 is obtained after one down sampling; The third layer of the encoder: the input feature map X 2 ′ with a scale of 24*24*32, the feature map X 3 is obtained after two convolutions to extract the feature, and the feature map X 3 ′ with 64 channels and size 12*12*16 is obtained after one down sampling; The fourth layer of the encoder: the input feature map X 3 ′ with a scale of 12*12*16, the feature map X 4 is obtained after two convolutions to extract the feature, and the feature map X 4 ′ with 64 channels and size 6*6*8 is obtained after one down sampling; The fifth layer of the encoder: the input feature map X 4 ′ with a scale of 6*6*8, and the feature map X 5 with 64 channels and a size of 6*6*8 is obtained after two convolutions to extract the feature; The first layer of the decoder: the multi-scale features X 4 [16:48] and the deconvolutional feature map X 5 are spliced along the channel direction to obtain features Y 1 , and then 64 channels with a size of 12*12* 16 feature maps Y 1 ′ is obtained by performing feature fusion through two convolutions; In the second layer of the decoder, the multi-scale feature X 3 [16:48] and the up-sampled feature map Y ′ 1 are spliced along the channel direction to obtain features Y 2 , and then 64 channels with a size of 24*24*32 feature map Y 2 ′ is obtained by performing feature fusion through two convolutions; In the third layer of the decoder, the multi-scale feature X 2 [16:48] and the up-sampled feature map Y ′ 2 are spliced along the channel direction to obtain features Y 3 , and then 64 channels with a size of 48*48*64 feature map Y 3 ′ is obtained by performing feature fusion by twice using convolutional layer; and In the fourth layer of the decoder, the multi-scale feature X 1 [16:48] and the upsampled feature map Y 3 ′ are spliced along the channel direction to obtain the feature map Y 4 , and then 64 channels, with a size of 96*96*128 feature map Y 4 ′ is obtained by performing feature fusion through two convolutions. Finally, a convolution kernel of size 1*1*1 is used to generate a localization heat map.
Through the method proposed in this paper, the precise positioning of the lumbar vertebrae can be obtained, which assists the subsequent accurate segmentation of the lumbar vertebrae.

Image cropping
After completing the localization of the lumbar vertebrae, a bounding box containing the lumbar vertebrae can be obtained, and then the image containing the lumbar vertebrae can be cropped. The calculation of the bounding box requires the first and last occurrences of subscripts with a HU value greater than a certain threshold in each dimension of the calculation site. This pair of subscripts constitutes the range of the bounding box in the corresponding dimension. After obtaining the three-dimensional range, it can be directly cropped, but the model prediction result may have multiple concentrated bright areas, the boundary range obtained by using the above algorithm will be too large, and there will be many black areas in the cropped image. Therefore, it is also necessary to determine the area with the largest range of light. In addition, in order to ensure that the cropped image completely contains the lumbar spine, the calculated bounding box needs to be expanded. For example, each dimension is expanded by 0.1 times the corresponding size of the image. Crop the image according to the resulting bounding box. This scheme uses four neighborhoods to obtain the range of the lumbar vertebra to be selected, and selects the voxels whose HU value is greater than 0.3 times the maximum HU value of the image as the possible lumbar vertebrae, and finally expands the obtained bounding box in each dimension by 0.1 times the size of the corresponding image. Figure 2 shows the cropping result of the lumbar vertebrae image for two samples, where the lumbar vertebrae region is in the rectangle with red line. It can be seen that a lot of redundant information is deleted after cropping.

A lumbar spine segmentation model based on 3D multi-scale XUNet
In order to accurately segment the lumbar spine, we propose a 3D multi-scale XUNet-based spine segmentation method (Figure 3). XUnet uses the Inception block for feature extraction, where the Inception block includes two feature extraction layers, the first layer includes three convolution blocks (denoted as convblock1-1, convblock1-2, and convblock1-3) and a maximum pooling, where convblock1-1 consists of 64 convolution kernels of size 1*1 *1, convblock1-2 consists of 32 convolution kernels of size 1*1 *1, convblock1-3 consists of 16 convolution kernels of size 3*3*3 convolution kernels. The second layer includes three convolution blocks (denoted as convblock2-1, convblock2-2, and convblock2-3), where convblock2-1 consists of 128 convolution kernels of size 3*3*3, and convblock2-2 consists of 32 It consists of 5*5*5 convolution kernels, and convblock2-3 consists of 32 1*1*1 convolution kernels. The feature extraction results from convblock1-1, convblock2-1, convblock2-2, and convblock2-3 are concatenated and fused as the final output of the Inception block. This enables the aggregation of features at different semantic scales. At the same time, the Inception block is used to replace the convolution operation, which further reduces the redundant parameters in the network while deepening the width and depth of the network and improving the expressive ability of the network.
The specific structure of the network is as follows: 1. Suppose the input is X ∈ R H * W * D * C ,X i ∈ R H i * W i * D i * C i , i ∈ [1, 4] is obtained after feature extrac-F I G U R E 2 Cropping result of the lumbar vertebrae image for two samples, where the lumbar vertebrae region is in the rectangle with red line.

F I G U R E 3 Schematic diagram of 3D multi-scale XUNet network.
tion by the encoder, then the decoding feature Y i can be expressed as: where downsampl is downsampling, upsample is upsampling, and is an activation function, f 3 * 3 * 3 means that the convolution operation is performed with a convolution kernel of 3 *3*3 size.
1. Multi-scale feature fusion is realized in the decoding process, and the specific flow of data in the network is as follows: The first layer of the encoder: the input scale is 64*64*128*1, the feature map X 1 is obtained after two convolutions to extract the feature, and the feature map with 32 channels and the size of 32*32*64 is obtained after one downsampling; The second layer of the encoder: the input feature map X 1 ′ with a scale of 32*32*64, the feature map is obtained after two convolutions to extract the feature, and the feature map X 2 ′ with 64 channels and a size of 16* 16 *32 is obtained after one downsampling; The third layer of the encoder: input feature map X 3 with a scale of 16 X 2 ′ *16*32, extract features through two convolutions to obtain a feature map, and a feature map X 3 ′ with 128 channels and a size of 8*8*16 is obtained after one downsampling; The fourth layer of the encoder: the input feature map X 3 ′ with a scale of 8*8*16, after a convolution and an Inception block to extract features, and a feature map X 4 with 256 channels and a size of 4*4*8 is obtained; The first layer of the decoder: the multi-scale feature maps X 2 , X 3 and the deconvolutional feature map X 4 are spliced along the channel direction to obtain the feature map Y 1 , and then feature map Y 1 ′ with 128 channels and a size of 16*16*32 is obtained by performing feature fusion through two convolutions; In the second layer of the decoder, the multi-scale feature maps X 1 , X 2 , X 3 and the up-sampled feature map Y ′ 1 are spliced along the channel direction to obtain feature map Y 2 , and then feature map Y 2 ′ with 64 channels and a size of 32*32*64 is obtained by performing feature fusion through two convolutions; and In the third layer of the decoder, the feature maps X 1 , X 2 and the multi-scale feature map Y 2 ′ are spliced along the channel direction to obtain the feature map Y 3 , and then a feature map Y 3 ′ with 32 channels and a size of 64*64*128 is obtained by performing feature fusion through two convolutions. Finally, combine the Softmax function to complete the segmentation and return the segmentation result graph.
Considering that too many parameters of the model will bring unnecessary computing overhead, the Inception block is used to replace the original convolution, which reduces the computing overhead while increasing the depth and width of the network, and improves the expressiveness of the model. The specific steps are as follows: 1. The Inception block includes two feature extraction layers, the first layer includes three convolution blocks (denoted as convblock1-1, convblock1-2, and convblock1-3) and a maximum pooling, where convblock1-1 consists of 64 convolution kernels of size 1*1*1, convblock1-2 consists of 32 convolution kernels of size 1*1*1, convblock1-3 consists of 16 sizes of 3*3*3 convolution kernels. The second layer includes three convolution blocks (denoted as c onvblock2-1, convblock2-2, and convblock2-3), where convblock2-1 consists of 128 convolution kernels of size 3*3*3, convblock2-2 consists of 32 convolution kernels with a size of 5*5*5, and convblock2-3 is composed of 32 convolution kernels with a size of 1*1*1. The feature extraction results from convblock1-1, convblock2-1, convblock2-2, and convblock2-3 are concatenated and fused as the final output of the Inception block.

Experimental data
The VerSe 2020 public dataset and our own hospital dataset are used in this paper. Our study is specifically focused on the segmentation of lumbar vertebra. We have selected the CT data which covers the lumbar vertebra. For the VerSe 2020 dataset, after the careful selection, we have obtained a total of 156 CT samples with the lumbar vertebra. The experiments were performed with a five-fold cross-validation. For our own hospital dataset, there are 500 samples of CT data in total. All these datasets are provided with ground truth segmentation results from radiologists. A total of 400 samples are utilized for training and other 100 samples are utilized for testing. Figure 4 shows three different sections of a sample of 3D CT image in the dataset, horizontal, sagittal, and coronal. Data preprocessing mainly includes generating spine heatmaps and adding windows in the images. Figure 5 shows an example of the heatmaps and Figure 6 shows the images before and after adding the windows. The lumbar spine heatmap can be obtained by Gaussian smoothing on the lumbar spine mask. This heatmap can be used as a target for a subsequent localization network, which can roughly represent the location of the lumbar spine. Looking at the literature, it can be known that the window and window width of the lumbar skeletal tissue are 400 and 1800, respectively, which can better present the lumbar vertebrae to the network.

Experimental environment
This article uses the python programming language, which integrates the deep learning framework Tensorflow and the image processing library. The code of this article is completed under Windows system using Pycharm as an integrated development environment. The convolutional neural network part uses the popular deep learning framework Tensorflow. The software development environment and hardware equipment used in this paper are as follows. Software development environment: Windows 10 system, Pycharm 2020.3 × 64 version, Tensorflow 1.16. Hardware equipment: Nvidia 1080T i GPU video memory 1 1G, memory 6 4G, hard disk 2 T.

Evaluation metrics
The performance is evaluated using several commonly utilized evaluation metrics, including dice similarity coefficient (DSC), false negative dice (FND), false positive dice (FPD), maximum symmetric surface distance (MSD), and average symmetric surface distance (ASD).  the rate of under-segmentation and over-segmentation, respectively.Moreover,DSC,FND,and FPD values range from 0 to 1. The metrics of MSD and ASD can quantify the distance between segmentation contour surface and ground truth contour surface. MSD is defined by the following equation: where d(i, j) represents the contour distance between segmentation results and ground truth. ASD is defined by the following equation:

Experimental results
To verify the performance of our method, our proposed method is compared with the classic Unet method, the iterative fully convolutional neural method (IFCN) 22 and the nnUnet method. 25 In this paper, these methods are experimented on the same dataset as our method, and the same preprocessing method is adopted. We further verified the performance of the proposed method through quantitative analysis on the VerSe dataset and our hospital dataset. The quantitative analysis results of different segmentation results are shown in Tables 1 and 2. The method proposed in our study can achieve better segmentation results. The quantitative analysis is consistent with the visual evaluation. The performance on the hospital dataset is better than the VerSe dataset.The reason is that there are more training data in hospital dataset.
The position part and the inception part is further evaluated with the ablation experiments. Table 3 shows the quantitative analysis of the ablation experiments.  Removing one part can slightly decrease the performance of the proposed method. The experiments further evaluate the performance of the developed modules.