A survey on artificial intelligence in pulmonary imaging

Over the last decade, deep learning (DL) has contributed to a paradigm shift in computer vision and image recognition creating widespread opportunities of using artificial intelligence in research as well as industrial applications. DL has been extensively studied in medical imaging applications, including those related to pulmonary diseases. Chronic obstructive pulmonary disease, asthma, lung cancer, pneumonia, and, more recently, COVID‐19 are common lung diseases affecting nearly 7.4% of world population. Pulmonary imaging has been widely investigated toward improving our understanding of disease etiologies and early diagnosis and assessment of disease progression and clinical outcomes. DL has been broadly applied to solve various pulmonary image processing challenges including classification, recognition, registration, and segmentation. This article presents a survey of pulmonary diseases, roles of imaging in translational and clinical pulmonary research, and applications of different DL architectures and methods in pulmonary imaging with emphasis on DL‐based segmentation of major pulmonary anatomies such as lung volumes, lung lobes, pulmonary vessels, and airways as well as thoracic musculoskeletal anatomies related to pulmonary diseases.

A breakthrough moment for AI-steered computer vision and image recognition may be attributed to the development of AlexNet (Krizhevsky et al., 2012), when a convolutional neural network (CNN) architecture was applied to solve a real-life recognition challenge and surpassed the performance of human visual recognition (Deng et al., 2009;Russakovsky et al., 2015).Since the astonishing performance of AlexNet, the triumph of deep learning (DL) and AI has been continuing, and the winners of computer vision and medical imaging challenges are almost always a DL method.A unique feature of CNN, as compared to traditional multi-layered perceptron, is the introduction of convolution layers combined with pooling that represent multiscale convolution filters and deliver a powerful tool of data-driven multiscale feature optimization to solve complex challenges in computer vision and image recognition.The history of visual pattern recognition using NN goes back to 1979, when Fukushima presented "neocognitron" and demonstrated its ability to recognize stimulus patterns not affected by shift in position or small distortion in shapes (Fukushima & Miyake, 1982).In a widely cited pioneering work, LeCun et al. (1989) introduced convolution kernels in a NN architecture and applied backpropagation to learn kernel weights establishing the idea of data-driven spatial feature optimization.Also, the authors demonstrated that their network was able to recognize U.S. postal service-provided handwritten zip code digits (LeCun et al., 1989).In 1998, LeCun et al. presented another landmark work (Lecun et al., 1998) and introduced pooling between layers to simulate multi-scale features and developed multi-layered CNN, which was successfully applied to recognize hand-written digits.In 2006, Hinton et al. (2006) made a significant contribution and coined the idea of DL and introduced deep belief networks with multiple levels of feature representation at different levels of abstractions which was the state-of-the-art approach for image recognition prior to AlexNet.Open access software libraries and platforms, such as TensorFlow (Abadi et al., 2016), Keras (Géron, 2022), convolutional architecture for fast feature embedding (Caffe; Jia et al., 2014), PyTorch (Paszke et al., 2019), Theano (Al-Rfou et al., 2016), and so forth, are available for efficient implementation and application of DL, which may also be attributed to rapid and widespread growths of DL in various computer vision and imaging applications (Deng & Yu, 2014;Goodfellow et al., 2016;LeCun et al., 2015;Litjens et al., 2017).
Recently, Vaswani et al. (2017) have introduced a fundamentally novel NN architecture, referred to as Transformer, that learns intrinsic features and global contextual information of sequential data, which has drawn significant research attention and established a new trend in developmental AI.Transformers were first applied for natural language processing (NLP), and use a technique known as attention or self-attention (Vaswani et al., 2017) that learns and characterizes influence and mutual dependence of sequential data.Compared to RNN models, the Transformer processes the entire input sequence at once that enables high parallelization allowing large training datasets and robust knowledge acquisition.This powerful notion of parallel learning from large sequential data has led to several groundbreaking developments including Bidirectional Encoder Representations from Transformers (BERT; Devlin et al., 2018) and Generative Pre-trained Transformer (GPT;OpenAI, 2023), which were trained with large language datasets and fine-tuned using task-specific smaller datasets.Subsequently, significant research efforts have been devoted toward integrating attention mechanisms (Fedus et al., 2022;Vaswani et al., 2017) into the mainstream modern AI using CNN frameworks (Bello et al., 2019;Dosovitskiy et al., 2020;Ramachandran et al., 2019;Vaswani et al., 2021;X. Wang, Girshick, et al., 2018;Yin et al., 2020).CNN-based models learn local convolutional kernels delivering the notion of data-driven feature optimization, which are aggregated and propagated through different convolution layers bringing the notion of muti-scale feature optimization.On the other hand, Transformer architectures offer the ability to learn data dependencies over the entire image space (Raghu et al., 2021).Others have used Transformer frameworks to learn and encode effectiveness of features and their long-range dependencies to solve a target AI task.In a landmark work, referred to as Vision Transformers (ViTs; Dosovitskiy et al., 2020), Transformer modules have been used to fully replace convolution kernels in deep NN and operate on a sequence of image patches.Transformer-based models, either in a pure form or as well as imaging tasks, which is the basis of the current survey.A major challenge with organizing this survey is that several imaging modalities have been applied to pulmonary research to evaluate various physiologic aspects of pulmonary anatomies in different diseases.Therefore, this survey paper is laid out in three major modules-(1) review of research problems and applications in pulmonary imaging, (2) description of different DL methods and architectures and their stellar strengths and areas of applications, and (3) task-specific review of roles and applications of AI in pulmonary imaging.Specifically, this article starts with a short review of different pulmonary diseases and the roles of imaging to facilitate the understanding of etiologies, diagnosis, and assessment of disease severity, progression, and clinical outcomes in Section 2. Section 3 reviews major variants of DL methods and network architectures, describes their inherent strengths, and discusses their roles and applications to pulmonary imaging research.Section 4 presents a taskspecific review of the roles and applications of AI for the assessment of different thoracic anatomies related to pulmonary diseases.Finally, a short conclusion is drawn in Section 5.

| PULMONARY IMAGING AND LUNG DISEASES
Medical imaging is a means to visualize and assess both anatomy and function of inner organs and tissues in the human body.Radiography may be considered as the first medical imaging modality that originated with the discovery of x-ray by Röntgen in 1895 (Röntgen, 1896).Three-dimensional (3D) medical imaging emerged in the form of CT (Hounsfield, 1973), which has revolutionized medical imaging and has been described as the greatest advancement in radiology since the x-ray (Kasban et al., 2015).MR imaging (MRI) was invented in the early 1970s, and the first MRI prototypes were tested in 1980 (Lauterbur, 1973).Ultrasound imaging emerged prior to CT and MRI, and it has been widely used in clinical settings as well as in research studies (Szabo, 2004).Nuclear imaging, for example, single-photon emission CT (SPECT; Kuhl & Edwards, 1963) and positron emission tomography (PET; Sweet & Brownell, 1955) record radiation emitting from within the body, and its primary emphasis is on imaging function.Often, a nuclear scanner is coupled with a CT or MR imaging machine to collect co-registered functional and anatomic imaging (Bar-Shalom et al., 2000;Hasegawa et al., 1990;Lang et al., 1992).Over the last several decades, researchers have extensively investigated different medical imaging modalities exploring their roles for different diagnostic purposes and for understanding of etiology, pathogenesis, progression, and health impacts of different diseases (Bercovich & Javitt, 2018;Elangovan & Jeyaseelan, 2016).
Lung disease refers to several types of respiratory system disorders that impacts the ability to breathe and, also, reduces the efficiency of pulmonary function in terms of gas exchange (Decramer et al., 2008;Marcus et al., 2000;Nunes et al., 2017).Lower respiratory diseases, such as chronic obstructive pulmonary disease (COPD; Celli, MacNee, & Force, 2004) and asthma (Holgate, 2008), as well as other lung diseases such as lung cancer (Alberg & Samet, 2003), pneumonia (Hoare & Lim, 2006), pleural effusion (Light, 2002), pulmonary fibrosis (Gross & Hunninghake, 2001), and, more recently, COVID-19 (Li et al., 2020) are leading causes of death in the United States (Ahmad & Anderson, 2021;Quaderi & Hurst, 2018).In 2017, 544.9 million people worldwide had a chronic respiratory disease, representing a 39.8% increase compared with 1990 (Soriano et al., 2020).For most lung diseases, early diagnosis is crucial for the therapeutic outcome (Gassert et al., 2021).Radiographic intensity and opacities as well as patterns of opacities in chest x-ray and CT imaging are popularly used to characterize and diagnose many lung diseases.
Chest imaging is used to examine different anatomies in the thorax, which includes the heart and major blood vessels, respiratory system (lungs, airways, chest wall, diaphragm) as well as a number of important bones (thoracic spine, ribs, sternum) and muscles (pectoral muscles and erector spinae; Morris et al., 2004;Raju et al., 2017;Ritman et al., 1980).Different chest imaging modalities, including x-rays, CT, MR, and PET, have been adopted in research and clinical studies related to cardiac and lung diseases (Bild et al., 2002;Hoffman et al., 2009;Sieren et al., 2016).Hyperpolarized 129 Xe gas MR imaging has been used to visualize the lung anatomy and physiology that are difficult to image with conventional MR (Mugler et al., 2010;Oros & Shah, 2004).Although MR and PET imaging have been applied to lung disease-related studies (Jaffer et al., 2002;Stender et al., 2014), chest x-ray and CT are the two most common modalities for pulmonary imaging (Donohue et al., 2012;Regan et al., 2010;Ru Zhao et al., 2011;Sieren et al., 2016).Over decades, chest x-ray has remained the clinical choice for initial diagnosis of lung abnormalities since it is quick, effective, and relatively cheap with low radiation exposure (Damilakis et al., 2010).Often, chest CT is used as the follow-up screening tool after the identification of lung abnormalities in x-rays, which allows detailed assessment of lung parenchyma for emphysema and air-trapping (Gevenois et al., 1996;Herth et al., 2018; see Figure 1).Also, chest CT scans are used to compute parametric response mapping (PRM; Galb an et al., 2009;Pompe et al., 2015Pompe et al., , 2017) ) and pulmonary fibrosis (Gross & Hunninghake, 2001) and airways for wall thickening (Berger et al., 2005), mucus plugging (Rogers, 2004), bronchiectasis (McGuinness & Naidich, 2002), and so forth.It has been observed in nationwide COPD studies that CT-based metrics correlate with disease severity, progression, and clinical outcomes.For example, CT-based airway measures of wall thickness, lumen cross-sectional area, and detected airway branch counts are significantly associated with COPD severity (Berger et al., 2005;Charbonnier et al., 2019;Donohue et al., 2012;King et al., 2000;Kirby et al., 2018Kirby et al., , 2020;;Niimi et al., 2003).Chest CT-derived pectoral muscle area (PMA) was shown to be associated with increased 5-year mortality in current smokers without airflow obstruction and was more significantly related to spirometry measures, dyspnea, and 6-minute-walk distance compared with body mass index (BMI; Diaz et al., 2018;McDonald et al., 2014).
Several large imaging-based longitudinal studies (Armato et al., 2011;Bourbeau et al., 2014;Jarjour et al., 2012;Regan et al., 2010;Ru Zhao et al., 2011;Smith et al., 2014;Vestbo et al., 2008) have been designed and established to investigate the roles of pulmonary imaging-based markers in lung disease with focus on disease etiology, progression, and clinical outcomes.Specifically, the Genetic Epidemiology of COPD (COPDGene; Regan et al., 2010) study is a longitudinal multi-center study with over 10,000 participants aimed to investigate the underlying genetic factors associated with COPD.Chest CT scans at inspiration and expiration for smokers with varying degrees of COPD have been collected at each visit of this longitudinal study, and the roles of image-based features of the lungs, airways, and thoracic musculoskeletal anatomy have been investigated (Bhatt et al., 2019).The Multi-Ethnic Study of Atherosclerosis (MESA) investigates the prevalence, correlates, and progression of subclinical cardiovascular disease in over 6500 participants.Cardiac CT scans were longitudinally collected in this study to measure coronary calcium, and, in later phases of the study, additional full lung CT scans were collected to investigate the relationships between lung features and atherosclerosis (Bild et al., 2002;Donohue et al., 2012;Hoffman et al., 2009).The Severe Asthma Research Project is another longitudinal study with over 700 adult and child participants designed to explore the etiology and progression of severe asthma with focus on improving clinical treatment of the disease (Jarjour et al., 2012).Chest CT-based F I G U R E 1 Illustrations of emphysema and air trapping on chest CT images.(a) Emphysema in a male participant from the Genetic Epidemiology of Chronic Obstructive Pulmonary Disease (COPD) study with preserved lung function.A coronal CT image slice is shown with and without markings of emphysema.(b) Same as (a) but for a male participant with severe COPD.(c,d) Same as (a,b) but for air trapping.Emphysematous regions were computed as regions with CT intensity below À950 HU in inspiratory scans, while air trapping were found in expiratory scans by thresholding the lung parenchyma at À856 HU (Herth et al., 2018).CT image settings: level = À200 HU, window = 2000 HU.
features of the lung and airway are used in this study to explain the developments and progression of severe asthma in patients and investigate different mechanisms contributing to different asthma types.Chest x-rays are traditionally used for lung cancer screening (Fontana et al., 1986).Modern lung cancer screening studies have taken advantages of the increased sensitivity and detail of chest CT imaging and used it to detect and classify pulmonary nodules (Ardila et al., 2019;McMahon et al., 2008;Menezes et al., 2010;Wender et al., 2013;Zhao et al., 2011).Recently, several studies have used chest x-rays and CT imaging to study the impacts of COVID-19 and assess the impacts of the disease on the lung parenchyma such as ground-glass opacities, vascular enlargement, and traction bronchiectasis (Hani et al., 2020;Y. Wang, Dong, et al., 2020;Zhao et al., 2020).
The amount of expert time required, inter-expert bias, and data precision are key difficulties for utilizing vast and diverse image data from large longitudinal studies collected at different sites using different scanners and imaging setups.Automated methods characterizing and quantifying effective features are the key to take maximum advantage of such studies and establishing such methods has proven to be a persistent bottleneck due to the challenges emerging from the complexity of the tasks and diverse image data.Over decades, traditional image processing and ML methods have gradually advanced and established to meet the needs of these large studies (Tschirren, Hoffman, et al., 2005;Tschirren, McLennan, et al., 2005).But there is ample need and opportunity for further developmental research improving automation, accuracy, and efficiency of image computational methods and introducing novel and effective image markers.Recent advancements in AI and emerging DL methods have created a new ocean of opportunities and tools to address the challenges related to automation of image-based measures.The unique strength of DL methods is their ability to extract optimized multi-scale features from unstructured data, such as images, introducing the notion of efficient data-driven learning of complex features and performing more computationally intensive tasks.Benefits of DL are further compounded in large studies due to the known positive association between method-performance and the size of training datasets (Sun et al., 2017).

| DEEP LEARNING IN PULMONARY IMAGING
DL has been widely used in various medical imaging applications including anatomic object segmentation, classification, and detection as well as image enhancement and registration (Deng & Yu, 2014;Litjens et al., 2017;Shen et al., 2017).Object segmentation sets the foundation toward quantifying image-based semantic knowledge, which has been investigated over decades and yet has remained a primary challenge in many research applications.Over the last decade, DL has been popularly applied to address the segmentation challenges in two-dimensional (2D; Ronneberger et al., 2015), 3D (Çiçek et al., 2016), and time-series (Isensee et al., 2017) medical imaging (Gerard et al., 2019;Hooda et al., 2019;Nadeem et al., 2021) applications.Early applications of DL in medical imaging were related to CAD, where DL was applied for data-driven selection of optimum multi-scale features for detection and classification of diseases or abnormalities (Kumar et al., 2015;Ozturk et al., 2020;Song et al., 2017).Previously, classification features were manually selected for traditional CAD systems (Giger et al., 2008), which requires expert domain knowledge and is constrained by individual preference bias and lack of multi-scale and comprehensive feature representation.In pulmonary imaging, DL has been popularly applied in nodule classification and cancer detection from chest CT scans (Hamidian et al., 2017;Kumar et al., 2015;Song et al., 2017).Lately, DL research in multimodal medical image registration has gained steam, where the roles of DL have been explored from different perspectives including conception of deep similarity metrics and supervised and unsupervised transformation estimation (Eppenhof & Pluim, 2018b;Fu, Lei, Wang, Curran, et al., 2020;Haskins et al., 2020).In pulmonary imaging applications, DL has been applied for registration of multi-volume time-series as well as longitudinal chest CT scans (Fu, Lei, Wang, Higgins, et al., 2020;Lei et al., 2020).Recently, introduction of generative DL networks has strengthened the medical imaging research avenues related to enhancement of image quality and resolution and harmonization of multi-site and multi-scanner image data (Wang, Chen, & Hoi, 2020;Yang et al., 2019;You et al., 2020).

| Convolutional neural networks
The convolutional and pooling layers in CNN were inspired by classic notions of modeling complex cells sensitive to moving stimuli as linear combination of simple cells (Hubel & Wiesel, 1959).Specifically, CNN introduced spatial adherence or grid patterns of input data in NN that brought about a new paradigm in DL.The notion of spatial adherence in a hierarchical multi-layer CNN offers a novel and unique mechanism of data-driven learning of multiscale spatial features in computer vision and image understanding applications (Goodfellow et al., 2016;LeCun et al., 2015).In general, CNN includes multiple hierarchical levels of convolution and pooling layers, and a final layer of fully connected NN (see Figure 2).A convolution layer performs feature extraction followed by activation layer, such as the Rectified Linear Unit or ReLU (Nair & Hinton, 2010), while the pooling layer down samples convolution layer outputs for the next hierarchical level at a lower scale.Finally, the high-level features extracted as the output of the last convolution layer is flattened as a vector and passed to a fully connected NN to accomplish the overall classification or recognition objective.A digital image is a 2D (or 3D) array of pixel (or voxel) intensity values.For each convolution layer, multiple small kernel grids representing multiple feature operators are used and convolved with the input image to generate the outputs of the convolution layer.During the training of CNN, the weights of these kernel grids at each convolution layer are optimized leading to the notion of feature learning.At a single convolutional layer, multiple convolutional kernels representing different image features, for example, edge features along different directions, are simultaneously optimized.As one layer of convolution and pooling feeds its output into the next layer, extracted features become progressively more complex, for example, from edges in different directions to complex patterns (Goodfellow et al., 2016;LeCun et al., 2015).
The implementation of a CNN for a specific imaging application has two major steps: (1) network architecture design and (2) network learning.The network architecture design step includes specifications of network hyperparameters including the number of hierarchical levels, the number of convolutional layers at individual levels, the number and size of kernels at individual convolution layers, and the number of neurons in and depth of the fully connected NN layer.During the CNN learning phase, referred to as training, the grid weights of different convolution kernels as well as the weights and biases of the final fully connected NN layer are optimized using error backpropagation and gradient descent (Ruder, 2016).A large volume of training data of the order of thousands to millions of samples, usually with known outputs, is required to train CNN and, often, data augmentation approaches are adopted to inflate the training data size from an available limited dataset.Frequently, randomized affine or nonlinear transformation transformations, additive noise, intensity and contrast alteration, and so forth are applied for data augmentation (Perez & Wang, 2017;Shorten & Khoshgoftaar, 2019).Also, it is important to identify the appropriate the network loss function for training based on application objectives and expected class distribution in the training dataset (Sudre et al., 2017).Due to rapid advancements in computation technology and ease of design and implementation of CNN using available open-source software libraries (Abadi et al., 2016;Géron, 2022;Jia et al., 2014), it has become feasible to iteratively optimize the right combination of network architecture and training features (Nadeem, Comellas, Hoffman, & Saha, 2022;Zhong et al., 2019).
The pyramid-like architecture of CNN embeds a natural mechanism of merging multi-scale features to accomplish a high-level classification or recognition tasks, and it was evidenced by a record-breaking recognition performance of CNN on a highly challenging image dataset in 2012 (Krizhevsky et al., 2012) and, subsequently, further improved with the introduction of residual CNN (He et al., 2016).In medical imaging applications, CNN has been widely applied to accomplish classification and recognition tasks to facilitate detection of abnormalities and diagnosis of diseases related different anatomic regions (Litjens et al., 2017).Particularly, in pulmonary imaging research, CNNs have been extensively applied to address classification-related challenges in several pulmonary diseases including lung cancer (nodules Major components of a convolutional neural network architecture used for binary image classification. and lesions; Hu et al., 2018), tuberculosis (Lakhani & Sundaram, 2017), COPD (Gonz alez, Ash, et al., 2018), and pneumonia (Rajpurkar et al., 2017) from chest x-rays and CT.Cha et al. (2019) demonstrated the application of CNN to improve the diagnostic accuracy of lung cancer from chest x-rays and showed that CNN exceeds the performance of human experts for T1 lung cancer detection.Heo et al. (2019) used CNN to extract multi-scale deep image features and combined those with participants' demographic data to improve the performance of tuberculosis detection.A number of research groups (Dou et al., 2017;Gu et al., 2018;Khosravan & Bagci, 2018;Zhu et al., 2018) adopted 3D CNNs for automated detection of lung nodules in chest CT images, and the superiority of CNN-based methods was established in the CT-based lung nodule analysis (LUNA16) challenge (Setio et al., 2017).Hamidian et al. (2017) introduced a twostage approach to locate candidate regions for possible nodules in the chest CT image using a 3D fully convolutional NN (FCNN) and then to detect and classify nodules in candidate regions using a 3D CNN, which significantly improved computational throughput without compromising detection and classification performance.Humphries et al. (2020) applied CNN to classify emphysema patterns in chest CT images from large nationwide COPD studies (Regan et al., 2010;Vestbo et al., 2008) and showed that CNN-based emphysema classification improves the prediction of COPD clinical outcomes.Rajpurkar et al. (2018Rajpurkar et al. ( , 2017) developed a CNN-based CheXNet using ChestX-ray14 dataset (Wang, Peng, et al., 2017) to classify different pathological abnormalities and diseases from chest x-rays that achieved an accuracy comparable to the performance of practicing radiologists.

| U-net
Segmentation of a target anatomic object is frequently used as one of the founding steps in medical imaging applications (Hesamian et al., 2019;Pham et al., 2000;Sharma & Aggarwal, 2010).Image segmentation involves both object detection as well as pixel-or voxel-level likelihood mapping defining the spatial extent of the target object.As discussed earlier, CNN architecture is better suited to accomplish object detection or overall image classification but may not be an optimal choice for pixel-level object likelihood mapping.Ronneberger and his research team presented landmark papers (Çiçek et al., 2016;Ronneberger et al., 2015) introducing 2D and 3D U-net architectures for pixel-and voxel-level likelihood classification.
Several variations of U-net, namely, V-net (Milletari et al., 2016), E-net (Paszke et al., 2016), and SegNet (Badrinarayanan et al., 2017) were concurrently reported, which share a similar symmetric encoder-decoder architecture.U-net together with its variations (Badrinarayanan et al., 2017;Milletari et al., 2016) has become the most popular DL network architecture for biomedical image segmentation (Hesamian et al., 2019;Litjens et al., 2017).The U-net was named for its "U" shape architecture that consists of a contracting (encoding) path on the left and a symmetric expansive (decoding) path on the right that enable pixel-level classification based on encoded multiscale features (see Figure 3).The contracting path follows the architecture of a traditional CNN, and, at each hierarchical layer, it convolves the image to extract feature patterns and passes those patterns into the next layer at larger scale through pooling.At the last stage of the contracting path, no pooling is involved, and the encoded feature patterns is directly passed to the entrance of the expansive path, which is designed to upscale feature maps to generate a pixel-level object likelihood classification map.Each layer on the expansive path functions in three sequential steps: (1) up-sampling of feature patterns from the previous layer, (2) concatenation of feature patterns from the layer on the contraction at the matching scale, and (3) convolution of the concatenated feature map.Both up-sampling convolution kernel in Step 1 and the feature convolution kernel in Step 3 are learned during training.Concatenation connection between features at matching scales on contracting and expansive paths allows gradients to flow through the network more freely, facilitating feature learning at individual scales and mitigating the issue of vanishing gradients, and reducing information loss due to restricted gradient flow through the entire decoding and encoding paths (He et al., 2016).At the final layer of the U-net, multiple feature maps are flattened, and features at matching pixel locations are convolved to generate a pixellevel object likelihood map.
The basic architecture of U-net was designed to deliver an efficient data-driven multi-scale feature learning paradigm to solve segmentation challenges, which essentially generates spatial likelihood mapping of the target object (Milletari et al., 2016;Ronneberger et al., 2015).In the introductory paper on U-net, Ronneberger et al. (2015) applied the method to segment neuronal structures from 2D images of electron microscopic stacks and demonstrated its outperformance as compared to sliding-window CNN method (Ciresan et al., 2012).3D extension of U-net was presented by Çiçek et al. (2016), and the method was applied to a confocal microscopic dataset of Xenopus kidney.Subsequently, U-net and its variants have been widely applied for various applications of medical image segmentation, and superiority of their performance as compared to previously established state-of-the-art approaches using conventional methods have been well-documented (Chen et al., 2020;Hesamian et al., 2019;Işın et al., 2016).Later, the basic principle of U-net architecture or its variants has been adopted to achieve data-driven solutions for medical image denoising (Han & Ye, 2018;Liu, Tracey, et al., 2019), registration (Cheng et al., 2019;Salem et al., 2020), super-resolution (Kolarik et al., 2019), and data augmentation (Li et al., 2018;Lucena et al., 2019).
In pulmonary imaging, the U-net and its variants have been primarily used for segmentation of different anatomic structures (Chen et al., 2020;Nadeem et al., 2021;Wu et al., 2021).Gordienko et al. (2018) applied the U-net to accurately segment lung regions from chest x-rays, which was used to restrict the domain for lung nodule detection.Skourt et al. (2018) applied the U-net to segment lung regions on axial slices of chest CT scans and validated the method using images from the Lung Image Database Consortium (LIDC-IDRI; Armato et al., 2011).Gerard et al. (2019Gerard et al. ( , 2020) ) developed FissureNet, using Seg3Dnet a memory-efficient version of U-net, and demonstrated its application to segmentation lung regions, lobes, and fissures from chest CT images.Wang, Chen, et al. (2019) presented a two-step pulmonary lobe segmentation method from chest CT scans.First, they applied a U-net to segment lung regions from a CT scans, and then used a modified V-net for lobe segmentation.

| Generative adversarial networks
The notion of generator-discriminator pair in a deep network architecture, referred to as generative adversarial networks (GANs), was introduced in a landmark paper by Goodfellow et al. (2020) that simulates deep learning of a training image distribution and enables generation of new image samples consistent with the input training data distribution.The generator (G), which learns to generate new image examples consistent with the training data distribution, and the discriminator (D), which learns to discriminate between real and generated images, are the two primary modules of a GAN (see Figure 4).These two modules are trained simultaneously in a zero-sum adversarial game until the generator's performance reaches to a level where the discriminator can no longer discriminate between generated fake and real images, that is, generated images by G are consistent with the training data distribution.G has an architecture similar to the expansive path of a U-net, without any skip connection concatenation, and, functionally, takes a fixed-length random vector as input and generates a new image sample.On the other hand, D is usually designed as an FCNN architecture that takes an image input and generates a binary class label of real or fake (generated).Both generator and discriminator are simultaneously trained using error back propagation and, for a specific training case, the error loss passed to the generator if the discriminator successfully labels the generated image as fake, otherwise, the discriminator is penalized.GAN has been successfully applied in several image-generation applications including text-to-image synthesis (Xu et al., 2018), super-resolution (Ledig et al., 2017), and image-to-image translation (Zhu et al., 2017).A few of the popular variations of GAN architectures include conditional GAN (Mirza & Osindero, 2014), Cycle-Consistent Adversarial Networks (CycleGAN; Zhu et al., 2017), style-based GAN (Karras et al., 2019), least squares GAN (Mao et al., 2017), and cross-domain discovery GAN (DiscoGAN; Kim et al., 2017).Radford et al. (2015) designed an unsupervised GAN that does not require the pairing between input prompts and output images in the training data set.The unsupervised GAN is effective in several medical imaging applications including resolution enhancement, where it is difficult to generate matching low-and high-resolution images (Isaac & Kulkarni, 2015).
In pulmonary imaging, GANs have been used for synthetic data generation and data augmentation (Chartsias et al., 2017;Chuquicusma et al., 2018;Jin et al., 2018;Madani et al., 2018;Salehinejad et al., 2018) to improve the diversity of the dataset without compromising with data fidelity.Jin et al. ( 2018) have applied GAN to add nodules on a F I G U R E 4 Major components and connections in a generative adversarial network (GAN).The generator in a GAN learns to generate plausible data from a random input, while the discriminator learns to distinguish between generated fake data and real data.The generator and discriminator are trained simultaneously until the generator's performance reaches a level where the discriminator can no longer discriminate between generated fake and real images.
background CT images for the purpose of data augmentation in the context of DL-based nodule detection.GANs have been widely applied for enhancement of pulmonary CT and MR image resolution (Bing et al., 2019;Chen et al., 2016).Loey et al. (2020) used classical data augmentation and conditional GAN to augment chest CT datasets for COVID-19 detection and observed significant improvement in COVID-19 classification.Xu et al. (2020) employed a conditional GAN for super-resolution of chest x-ray images.Conditional GANs have also been applied for cross-modality image conversion (Bi et al., 2017;Jiang et al., 2018;Jin et al., 2019).Mahapatra et al. (2018) used CycleGAN to jointly segment and register chest x-rays and achieved an improved performance as compared to the approaches of separately using the images.

| Transformer networks
Vaswani et al. ( 2017) introduced a fundamentally new network architecture, the Transformer, that models the basis of recurrence and convolutions of CNN-based approaches using only the attention mechanism (Bahdanau et al., 2014).The Transformer model architecture consists of an encoder-decoder structure (Figure 5a) and use learned embeddings to convert the input tokens and output tokens to vectors of dimension d model .The encoder structure is designed as a stack of multiple layers, where each layer is composed of two sub-layers (Figure 5a).The first sub-layer is a multi-head self-attention mechanism to map input embeddings to queries and key-value pairs.From the training dataset, the attention mechanism learns to project the input embeddings into query and a set of key-value pairs, which are then mapped to attention outputs.Attention outputs are computed as weighted sums of the values, where the weights are computed using compatibility or scaled dot-products of the query and the key.Multi-head attention allows the model to jointly learn different subspace representations of the input embeddings in the training data to capture multiple dependencies and interactions among query and key-value pairs.The second sub-layer in an encoder layer is a fully-connected, position-wise feed-forward NN.This sub-layer incorporates the information about the relative or absolute position of tokens and adds nonlinearity to the output.Residual connections are employed around each of the two sub-layers, followed by layer normalization.The decoder consists of a stack of multiple layers similar to the encoder, but layer has three sub-layers (Figure 5a).The first sub-layer, masked multi-head attention, performs a similar function as the first sub-layer of the encoder except that it is modified to avoid learning from upcoming tokens in an output sequence to ensure that the learning process is not biased by upcoming tokens not available in an outcome sequence during applications.The second sub-layer computes multi-head attention between the encoder and decoder.Specifically, keys and values come from the output of the last layer of the encoder and queries are from the current layer of the decoder.This mechanism allows every position in the decoder to attend over all positions in the input sequence.The third sub-layer is a fully-connected, position-wise feed-forward NN adding nonlinearity to the output.Finally, the output of the last decoder layer is input to a linear layer and a softmax classifier to predict the output token.Note that positional encodings are added to the input embeddings at the bottoms of the encoder and decoder stacks to inject information about the relative or absolute position of the tokens in the sequence.Dosovitskiy et al. (2020) adopted the Transformer network for image classification and established the Vision Transformer (ViT) framework.Instead of using individual pixels, Dosovitskiy et al. proposed to split an image into smaller patches in a raster order and use these patches as tokens to incorporate local contextual information.Subsequently, these tokens are flattened and undergo the same input embedding procedure as the conventional Transformer.Unlike the conventional Transformer, ViT uses only an encoder path (see Figure 5b).Additionally, a learnable classification token, represented as a d model dimensional vector, is prepended to the sequence of input embeddings that serves as the image representation or class at the final Transformer encoder layer; essentially, ViT learns to generate the output classification token by training.Since the model compute attention between every pair of image patches, it learns both local and global contextual relationships in terms of position-wise correlations between tokens.This is in contrast to CNNs, which use convolution kernels to represent local features and aggregation of local features into deeper layers increasing receptive fields or scales of deeper features (Raghu et al., 2021).Therefore, Transformers are well-suited for imaging applications, where global contextual information is the basis for the target outcome, for example, image classification.
It may be noted that Transformers require a large amount of training data to build a representative set of tokens and reliable assessment of attention.Transformers have performed very well in tasks, where very large datasets with small images are available (Sun et al., 2017).However, in medical imaging applications, large repositories of curated datasets are rare, which may add to the challenges in adopting a Transformer architecture.An imaging task requiring pixel-level classification, for example, segmentation, requires a single pass of the entire network for each pixel making it computationally infeasible.To overcome this challenge, the shifted-window Transformer or Swin-Transformer (Liu et al., 2021) was developed, which uses smaller input patches (4 Â 4 vs. 16 Â 16 as used in ViT), limited attention mechanism to look at only nearby patches, and concatenation of output vectors after each encoding layer to increase receptive field similar to the CNN design.Cao et al. presented a Swin-Unet (Cao et al., 2023) using Swin-Transformer (Liu et al., 2021) that represents an U-type multi-scale architecture using pure Transformer modules.Recently, hybrid architectures combining the principles of Transformer and CNN designs have drawn significant research attention, and such network architectures are being widely applied in medical imaging (Parvaiz et al., 2023;Shamshad et al., 2023).In pulmonary imaging, applications of Transformers are mostly limited to lung cancer nodule recognition and classification (Yang, Myronenko, et al., 2021;Zheng, Gindra, et al., 2021;Zhou, Liu, et al., 2022) and COVID-19 detection (Gao et al., 2021;Hsu et al., 2021).

| Explainable AI
Explainable artificial intelligence (XAI) represents a research avenue that collectively refers to methods that, instead of only using the decision outcome, attempt to analyze the decision-making process of an AI model to facilitate its understanding from an end-user perspective (Arrieta et al., 2020).See Adadi and Berrada (2018), Arrieta et al. (2020), Došilovi c et al. (2018), Murdoch et al. (2019), andTjoa andGuan (2020) for reviews on XAI approaches.XAI approaches may be categorized from three different perspectives-(1) within-model versus post hoc XAI; (2) model-specific versus model-agnostic XAI; and (3) scope of XAI is global versus data-specific Within-model XAI requires a model's internal workings to be explainable, for example, linear regression or SVM which are models simple enough to be understood, but sophisticated enough to model an input-output relationship satisfactorily (Murdoch et al., 2019;Tibshirani, 1996).On the other hand, post hoc explanation involves analyzing a trained AI model to gain insight into learned relationships by considering the model as a black box (Adadi & Berrada, 2018).A model-specific explanation method is limited to a specific type of AI models and uses prior knowledge of model architecture and internal attributes (Adadi & Berrada, 2018).Model-agnostic explanation is independent of AI models, operating solely on the input and the output of the model and is, by definition, a post hoc approach.Global or dataset-level explanation provides general relationships learned by an AI model, while local explanation provides an explanation for each input data sample.
In pulmonary imaging, XAI has been adopted in forms of class activation mapping (CAM; Zhou et al., 2016) and gradient-weighted CAM (Grad-CAM; Selvaraju et al., 2017) in lung cancer prediction (Ausawalaithong et al., 2018;Hosny et al., 2018), COVID-19 classification (Brunese et al., 2020;Ko et al., 2020), COPD classification (Humphries et al., 2020), and multi-disease classification (Chen et al., 2019;Dunnmon et al., 2019;Sedai et al., 2018).CAM produces a heat map at the final convolutional layer of a CNN, which shows the regions of an input image that contribute to the output classification (Zhou et al., 2016).Multi-scale CAMs have been developed (Liao et al., 2019;Ma, Ji, et al., 2020) to explain the roles of multi-scale features, often, used in decision-making of medical imaging applications.The primary difference between Grad-CAM and CAM is that Grad-CAM computes the weights for each feature map based on the global average pooling of its gradients and applies a ReLU to emphasize only positive values.
Due to computational constraints, early DL-based lung segmentation methods from chest CT imaging were restricted to slice-by-slice approaches (Comelli et al., 2020;Skourt et al., 2018;Suri et al., 2021).Skourt et al. (2018) used a 2D U-net to segment lung parenchyma in axial chest CT slices from the LIDC-IDRI dataset and achieved a 0.95 Dice score.Comelli et al. (2020) observed that both U-net-and E-net-based slice-by-slice lung segmentation from chest CT images outperform state-of-the-art conventional image processing methods; however, they observed no significant performance difference between the two DL methods.Later, DL-based lung segmentation methods adopted volumetric approaches and used rectangular cuboid subregion inputs for training and testing to incorporate 3D contextual information.Park et al. (2020) presented a CT-based method that built upon presegmentation of the right and left lungs and used U-net to classify individual lung lobes.Wang, Chen, et al. (2019) used 2D U-net for CT-based slice-by-slice lung segmentation and a modified 3D V-net for lung lobe classification on images from the LUNA16 challenge dataset (Setio et al., 2017).Gerard and colleagues developed the state-of-the-art DL-based pipelines for CT-based lung (Gerard et al., 2021), fissure (Gerard et al., 2019), lobe (Gerard & Reinhardt, 2019), and fissure integrity segmentation (Althof et al., 2020; see Figure 6a).Each module of the LungNet-FissureNet-LobeNet pipeline consists of Seg3DNet, a variation of a multi-resolution cascaded CNN, to learn global and local features that are important for segmentation of pulmonary structures (see Figure 6b).The pipeline has been extensively evaluated on various imaging protocols in large multi-center studies including multiple lung diseases such as COPD, COVID-19, pulmonary fibrosis, and lung cancer, and it has been a leading performer in the Lobe and Lung Analysis 2011 (LOLA11) grand challenge for lobar segmentation.
Lung segmentation in chest x-ray images are used to facilitate CAD of lung cancer nodules and lesions (Hu et al., 2018), tuberculosis (Lakhani & Sundaram, 2017), COPD (Gonz alez, Ash, et al., 2018), pneumonia (Rajpurkar et al., 2017), and COVID-19 (Ozturk et al., 2020;Panwar et al., 2020).Early approaches of DL-based for lung segmentation from chest x-ray images used CNN and FCNN to isolate lung areas (Hooda et al., 2018;Huynh & Anh, 2019;Kitahara et al., 2019).Souza et al. (2019) developed a two-stage DL-based lung segmentation method from chest x-rays, where they applied a standard CNN for initial patch-based lung region classification followed by a residual-CNN to achieve the final refined segmentation.Hooda et al. (2018) presented a lung segmentation method from chest x-rays using FCNN, which fused outputs of feature maps at the end of each hierarchical level to predict the segmented lung region.More recent methods of chest x-ray based lung segmentation methods adopted U-net and other encoder-decoder architectures (Furutani et al., 2019;Kalinovsky & Kovalev, 2016;Mittal et al., 2018).Furutani et al. (2019) used U-net to segment lung areas form chest x-rays and achieved a Dice score of 0.91.Kalinovsky and Kovalev (2016) applied standard SegNet to segment lung areas in x-rays, while Mittal et al. (2018) used a modified SegNet by adding skip connections at the end of each hierarchical layer before pooling that achieved higher accuracy of x-ray based lung segmentation.

| Pulmonary vessels and airways
Segmentation and quantitative characterization of pulmonary vasculature are widely used for detection of pulmonary emboli (Loud et al., 2001;Qanadli et al., 2000) and hypertension (Devaraj et al., 2010;Shen et al., 2014).Early CT-based pulmonary vessel segmentation algorithms (Kaftan et al., 2008;Shikata et al., 2004;Van Rikxoort & Van Ginneken, 2013) used traditional image processing methods of region growing and vesselness (Frangi et al., 1998).Expectedly, recent algorithms have adopted DL methods for pulmonary vessel segmentation from chest CT imaging (Cui et al., 2019;Nam et al., 2021;Tan et al., 2021).Cui et al. (2019) used three parallel U-nets and stacks of axial, sagittal, and coronal chest CT slices to generate vesselness maps in three orthogonal planes and fused those maps to generate volumetric vesselness representation.Nam et al. (2021) used a single U-net and a stack of axial chest CT slices to compute a vessel-likelihood map from simulated noncontrast CT images.Tan et al. (2021) examined the performance of 12 different DL-based pulmonary vessel segmentation algorithms using both regular CT and CT angiography images and observed that U-net-like encoder-decoder DL architecture combined with advanced post-processing methods incorporating spatial distribution of DL-based likelihood map improve the accuracy of final segmentation.After vessel segmentation, separation of pulmonary arteries and veins is useful to improve detection of pulmonary emboli, assessment of pulmonary arterial hypertension, and characterization of arterial alterations in COPD (Charbonnier et al., 2015;Estépar et al., 2013;Wittenberg et al., 2012).Saha et al. (2010) used multi-scale opening on presegmentation pulmonary vascular trees to separate pulmonary arteries and veins in noncontrast chest CT images.Nardelli et al. (2018) studied multiple CNN-based architectures and introduced advanced postprocessing using graph-cut methods to separate pulmonary arteries and veins in noncontrast CT images from presegmentation pulmonary vascular trees.
3D U-net has been popularly adopted for airway segmentation, where a DL network is used to generate voxel-level airway likelihood maps from chest CT images.Jin et al. (2017) applied fuzzy distance and connectivity analysis (Saha et al., 2002;Udupa & Saha, 2003) and scale-based leakage detection (Nadeem et al., 2017(Nadeem et al., , 2021) ) to delineate airway tree volume from a DL-generated airway likelihood map.Wang, Hayashi, et al. (2019) added 3D recurrent convolutional layer at the highest resolution of a U-net and introduced a radial distance loss function that better conforms with the tubularity features of airway structures.Juarez et al. (2018) separately processed left and right lungs and adopted 3D U-net and a sliding window approach to segment airways.Qin et al. (2019Qin et al. ( , 2020) ) developed AirwayNet, where they revised the expansive path of a U-net by adding the lung distance map at the last hierarchical level toward emphasizing airway connectivity features.Yun et al. (2019) developed an airway segmentation algorithm that starts with locating airway candidate regions and, subsequently, for each candidate region voxel, uses CNN-based classification on three orthogonal planes through the voxel to confirm if it belongs to airways.Recently, Nadeem, Hoffman, and Saha (2018), Nadeem et al. (2021), andNadeem, Hoffman, Sieren, andSaha (2018) presented a novel multi-parametric segmentation approach referred to as freeze-and-grow that starts with a conservative segmentation parameter and iteratively relaxes the parameter to capture finer details, while rectifying segmentation leakages and freezing at leakage locations after each iteration.Experimental results (Nadeem et al., 2021) have demonstrated that the conventional freeze-and-grow algorithm outperforms DL-based results (Charbonnier et al., 2017;Jin et al., 2017).It was observed the conventional freeze-and-grow method coupled with DL significantly improved the computational efficiency and allows airway detection at low radiation CT scans (Nadeem et al., 2021;Nadeem, Comellas, Guha, et al., 2022).Different steps of a DL-powered freeze-and-grow algorithm are illustrated in Figure 7.The method was found to be robust for CT images of patients with bronchiectasis, where airway lumen morphology is significantly altered (see Figure 8).

| Lesions and nodules
Lung lesions and nodules are examined for diagnosis of lung cancer and monitoring of disease progression or treatment effects (Gould et al., 2013).Image-based quantitative assessment of lung lesions and nodules is widely used in CAD of lung cancer (Kalpathy-Cramer et al., 2016).Traditionally, chest x-ray images are used for initial screening of lung nodules (Fontana et al., 1986).More recently, chest CT imaging has become the standard for lung cancer screenings (Hoffman & Sanchez, 2017).Early methods of x-ray or CT-based lung nodule detection and classification (Dehmeshki et al., 2008;Mukhopadhyay, 2016;Nithila & Kumar, 2016;Paraagios & Deriche, 1999;Santos et al., 2014) used manually selected features and conventional classification approaches.However, the diversity in appearance, location, and size of lung nodules adds persistent challenges to manual selection of a comprehensive feature set, and here, the data-driven multi-scale feature optimization framework of a DL method plays F I G U R E 7 A block diagram of major steps in CT-based automated airway segmentation using deep learning and multi-parametric freeze-and-grow algorithms (Nadeem et al., 2021).a crucial role and offers major improvements in performance of image-based lung nodule detection and classification (van Ginneken et al., 2010).
The general approach of DL-based nodule segmentation involves training a DL network on a diverse set of nodule images, often, augmented using traditional or GAN-based approaches.Wang, Zhou, et al. (2017) presented a CT-based nodule segmentation method that used multi-scale 2D patches as well as a stack of axial CT image patches on two parallel FCNNs, and the method matched radiologists' performance on the LIDC-IDRI dataset.Wang, Lu, et al. (2018) used a 3D region-proposal CNN to simultaneously locate, segment, classify lung nodules in chest CT images.Wu et al. (2018) presented a 3D U-net-based method to segment lung nodules and used the intermediate feature maps at different hierarchical levels to train an NN-based classifier for malignancy score.Cao et al. (2020) used a set of three contiguous axial CT slices and another set of three transversal slices, both centered at the candidate voxel, as inputs to a dualbranched CNN network and applied central pooling for lung nodule segmentation, which achieved a Dice score higher than human experts.

| Thoracic musculoskeletal anatomies
Several lung diseases are linked to different comorbidities (Barnes & Celli, 2009;Boulet & Boulay, 2011;Tammemagi et al., 2003).For example, COPD is associated with significant extra-pulmonary systemic health complications (Agusti et al., 2003;Barnes & Celli, 2009;Holguin et al., 2005).The prevalence of skeletal muscle dysfunction (Marquis et al., 2002;Shrikrishna et al., 2012;Swallow et al., 2007) and osteoporosis (Biskobing, 2002) in COPD patients is wellknown, which put COPD patients at high risk of accelerated musculoskeletal aging.Pulmonary imaging provides an opportunity to segment and collect data on different thoracic musculoskeletal anatomies such as ribs, thoracic spine, pectoral muscles, and diaphragm, study COPD comorbidities such as osteoporosis and muscle dysfunction, and assess their impacts on disease progression and clinical outcomes.
Segmentation of the spinal column and labeling of individual vertebrae are used to explore spinal degeneration, vertebral fractures, and bone density in different lung diseases including COPD (Jaramillo et al., 2015).In the past, traditional image processing approaches were tailored to segment different thoracic musculoskeletal anatomies using their density, shape, and morphological features (Klinder et al., 2008;Lenchik et al., 2019;Rasoulian et al., 2013;Zhou, Wen, et al., 2022).Recently, DL has been popularly adopted to accomplish segmentation of thoracic musculoskeletal anatomies.Early methods used spine atlas or statistical shape models to segment spine and individual vertebrae in CT scans (Kadoury et al., 2013;Klinder et al., 2009).Sekuboyina et al. (2017) adopted a two-stage DL approach to segment the spine and individual vertebrae from CT scans.First, they used an R-CNN DL network to localize the spine column on sagittal image slices and, subsequently, applied U-net to segment the spine.Buerger et al. (2020) applied multi-staged U-nets to first achieve coarse binary classification and then, subsequently, accomplish multi-class classification of individual vertebrae at a finer scale.Recently, Nadeem, Comellas, Guha, et al. (2022) adopted a new approach of using DL as a low-level tool to obtain a voxel-level vertebral likelihood map and then applied multi-parametric iterative connectivity and centerline analysis to segment and identify individual vertebrae (see Figure 9).Lu et al. (2018) applied U-net on MRI to segment the spinal column and a multi-class CNN to assess stenosis grading.
Rib features are studied to assess chest wall morphology and dynamics in COPD and other lung diseases (Sverzellati et al., 2013).CT imaging is commonly used for rib features, and traditional approaches of rib segmentation involve CT intensity-based thresholding and connectivity analysis (Lee & Reeves, 2010;Staal et al., 2007) and shape and morphological modeling (Klinder et al., 2007), which were tailored to utilize the spinal proximity or elongated tubular features of ribs.Jin et al. (2020) developed a U-net-based FracNet and an automated approach to segment individual ribs and locate fractures in chest CT images.Yang, Gu, et al. (2021) developed a semi-automated method that uses thresholding and manual editing to generate a point-cloud representation of rib bone voxels, which are fed to a PointNet++ DL network (Qi et al., 2017) to obtain volumetric representations of individual ribs.Wu et al. (2021) presented an algorithm combining 3D U-net and 2D R-CNN to segment and label individual ribs and then locate rib fractures in chest CT scans.Recently, Nadeem et al. (2023) developed a CT-based automated rib segmentation method, which uses 3D U-net to generate a voxel-level rib probability map and, then, applies a multi-parametric iterative connectivity and thresholding approach to delineate and label individual ribs.
Pectoralis muscle features chest CT scans are often used as indicators of overall body muscle content and composition, and reduced PMA are found to be associated with COPD-related traits (McDonald et al., 2014), such as spirometry, dyspnea, and BODE (BMI, obstruction, dyspnea, and exercise) index (Celli, Cote, et al., 2004).Gonz alez, Washko, and San José Estépar (2018) used modified U-net to segment left and right major and minor pectoralis and subcutaneous fat regions and observed that, instead of single U-net, use of six separate U-nets for six different regions improves segmentation performance.Dutta et al. (2022) presented a CT-based pectoral muscle segmentation method, where they used U-net to generate a pixel-level pectoral muscle likelihood map and, then, applied a freeze-and-grow postprocessing to delineate the pectoral muscle area.Liu, Pan, et al. (2020) and Liu et al. (2022) applied DL-based CNNs for segmentation of muscle and assessment of body composition using body-torso-wide CT imaging.
The diaphragm is a major respiratory muscle, which is situated below the lung base, or diaphragmatic surface, and separates the abdomen from the chest.Based on manual tracking methods, it has been observed that the radius of maximum curvature is increased with BMI (Boriek et al., 2017), and the diaphragm motion is altered in patients with either emphysema or idiopathic pulmonary fibrosis (Kang et al., 2021).Several traditional methods using deformable models have been reported in literature (Beichel et al., 2002;Yalamanchili et al., 2010;Zhou et al., 2007) to segment the diaphragm in chest CT scans.Dynamic MRI has been applied to evaluate diaphragm motion (Hao et al., 2022;Tong et al., 2019).However, segmentation of the diaphragm in chest CT imaging has remained difficult due to the thin appearance of the diaphragm and lack of contrast with surrounding anatomic structures, and a fully automated robust algorithm for diaphragm segmentation, suitable for application to large studies, is yet to be established.

| CONCLUSION
DL has become a universal tool, which has been widely applied to pulmonary medical imaging research and studies.Different DL architectures have demonstrated their unique strengths in extracting and leveraging data-driven features toward addressing specific image processing and recognition challenges.For example, CNN-based architectures have been popularly applied to solve classification and recognition tasks, while U-net-like architectures have demonstrated their superiority in solving segmentation and registration challenges requiring spatial likelihood distribution or deformation vector mapping.GAN-based architectures have been commonly adopted for resolution enhancement, registration, and data augmentation utilizing the unique image generation feature of a GAN network.Recently, a new AI approach, called Transformer model, has emerged that is based solely on attention mechanisms, dispensing with recurrence and convolutions.However, being a purely data-driven approach, DL faces hurdles related to generalizability and scalability of a pretrained DL method to unseen data acquired using different protocols, scanners, or modalities.This specific hurdle of DL poses a tall task to the DL research community, and the importance of a reliable solution to this hurdle is enhanced by constantly emerging imaging technologies and large volumes of diverse image data being daily collected for research and clinical purposes.Different approaches have been adopted to mitigate this challenge, and the most common one is to properly curate the training dataset by including data from diverse sources.Also, refreshing and retraining DL networks using new data have been adopted to simulate learning augmentation.Recently, in the context of image segmentation, a research trend has been observed to use DL as a datadriven multi-scale feature extractor generating a spatial object-likelihood map, which is passed to an advanced image-processing cascade tailored to delineate the target object.The lack of transparency and understanding of learning and decision-making mechanisms in AI tools have added a fundamental hurdle in adopting AI in clinical settings, which has driven the emergence of a new research endeavor called explainable AI.In summary, the modern arena of AI and DL has brought previously unthought of potentials and drawn tremendous enthusiasm and momentum in machine learning and intelligent problem solving, and currently, the research community is passing through a euphoric phase toward understanding its real merits, strengths, and intelligence mechanisms and differentiating those from mere excitement.

F
I G U R E 5 Model architectures of Transformer and Vision Transformer (ViT) networks.(a) Transformer takes an input sequence and splits it into a sequence of tokens, transduce it into a vector, and then decodes it back into another sequence using solely the attention mechanism.(b) ViT splits an image into square patches, flattens and linearly embeds them, prepends an extra learnable class embedding, adds a position encoding to the embeddings, and feeds the resulting sequence of vectors to a conventional Transformer encoder.

F
I G U R E 6 CT-based segmentation of lung structures using deep learning (DL).(a) A DL-based pipeline for CT-based lung, fissure, lobe, and fissure integrity segmentation(Althof et al., 2020; Gerard et al., 2019Gerard et al., , 2021;; Gerard & Reinhardt, 2019).The fissure integrity segmentation produced by IntegrityNet identifies regions of radiographically missing or incomplete fissure (red).Solid black arrows indicate the CT image used as input for each module, dotted black arrows indicate the module output, and dashed black arrows indicate output of all previous modules used as input to each module.(b) Each module of the segmentation pipeline presented in (a) consists of a variation of a multi-resolution cascaded convolutional neural network to learn global (upper) and local (lower) features that are important for segmentation of pulmonary structures.This figure was provided by Drs. S. E. Gerard and J. M. Reinhardt.

F
I G U R E 8 Examples of airway segmentations on chest CT scans of a participant with preserved lung function (a) and another participant with bronchiectasis (b).For both cases, airway trees were segmented from inspiratory chest CT scans.

F
I G U R E 9 Intermediate results of CT-based automated vertebral segmentation, labeling, and fracture assessment on a sagittal image slice using deep learning (DL) and center-line based morphologic analysis of anterior, middle, and posterior vertebral heights (Nadeem, Comellas, Guha, et al., 2022).(a) CT representation of the spinal column on a truncated sagittal image slice.(b) Spatial likelihood map of vertebral bodies using U-net-based DL.(c) Segmented and labeled vertebrae T1 to L1.(d) Computerized detection of fractured vertebrae (red).