Classifying functional nuclear images with convolutional neural networks: a survey

: Functional imaging has successfully been applied to capture functional changes in the pathological tissues of a body in recent years. Nuclear medicine functional imaging has been used to acquire information about areas of concerns (e.g. lesions and organs) in a non-invasive manner, enabling semi-automated or automated decision-making for disease diagnosis, treatment, evaluation, and prediction. Focusing on functional nuclear medicine images, in this study, the authors review existing work on the classification of single-photon emission computed tomography, positron emission tomography, and their hybrid modalities with computed tomography and magnetic resonance imaging images by using convolutional neural network (CNN) techniques. Specifically, they first present an overview of nuclear imaging and the CNN technique, such as nuclear imaging modalities, nuclear image data format, CNN architecture, and the main CNN classification models. According to the diseases of concern, they then classify the existing CNN-based work on the classification of functional nuclear images into three different categories. For the typical work in each of these categories, they present details about their research objectives, adopted CNN models, and achieved main results. Finally, they discuss research challenges and directions for developing technological solutions to classify nuclear medicine images based on the CNN technique.


Introduction
The past few decades have witnessed a significant rise in demand for medical image analysis. This is because medical imaging provides an important base to display and differentiate the pathological tissue from the normal field of the body. In particular, structural medicine imaging has been broadly applied to acquire anatomic or morphological structures of organs and tissues of a human body for the disease diagnosis purpose. The main structural imaging modalities are ultrasound, magnetic resonance imaging (MRI), and computed tomography (CT) [1]. On the other hand, the physiological or functional medicine imaging, as an alternative, has been successfully applied to capture functional changes in pathological tissues in clinical examination over the past few decades [1].
For functional imaging, there is a four-order division [2]: the first order is used for imaging of organ motion, such as blood, lungs, and heart; the second order is responsible for imaging of excretory functions, such as kidneys and liver; and the third and fourth orders are in charge of imaging of metabolism (except excretory functions). Particularly, the imaging of metabolism is the privilege of nuclear medicine alone [2]. Common functional imaging techniques are positron emission tomography (PET), diffusion-weighted imaging (DWI), dynamic contrast-enhanced MRI, perfusion CT, magnetic resonance (MR) lymphography, single-photon emission computed tomography (SPECT), MR spectroscopy, and blood oxygenation level-dependent MRI [3].
In nuclear medicine, the main modalities of functional imaging are PET, SPECT, as well as their hybrid modalities of SPECT-MRI, PET-CT, PET-MRI, and SPECT-CT [4]. As a typical noninvasive medical imaging modality, nuclear medicine functional imaging (nuclear imaging for short) refers to injecting radionuclides (isotopes) into a patient's body and capturing information about areas of concern, e.g. lesions and organs [4].
Imaging equipment picks up the emitted gamma rays from isotopes to produce a map of the inside of a body and identify the body areas of concern. With the increasingly advanced technology, nuclear imaging has become much safer as a result of reducing the exposure to radioactive isotopes. Also, the isotopes in nuclear imaging studies have short half-lives. The injection into a body in a small quantity ensures that the radiation can be quickly flushed after an examination is completed.
With medical images, a decision could be made for disease diagnosis, treatment, evaluation, and prediction. As an indispensable supplement to clinicians' decision-making, computer-aided diagnosis/detection (CAD) has already been utilised extensively within radiology, playing a key role in the decision-making process [5]. Specifically, the CAD systems use computer-generated outputs as an assisting tool for a clinician to make a diagnosis. In particular, machine learning algorithms for diagnosis accept inputs of characterising different features to produce a probability distribution of two classes (e.g. disease present or absent) that helps clinicians. Furthermore, automated CAD is capable of generating an end diagnosis by using computer algorithms alone.
As a typical technique of deep learning, a convolutional neural network (CNN) was specifically designed for images, with its superior capability of automatically extracting from the low-level to high-level features from images for classification [6]. There is a great number of successful applications of CNNs in medical images, such as disease classification, object detection, and object segmentation [7][8][9]. For example, CNN-based models have been built for diagnoses of cancer [10], classification of liver, breast, and blood neoplasias [11], detection of pulmonary nodule [12], brain tumour analysis [13], as well as segmentation of kidney contours [14], tumour [15], and brainstem [16].
IET Image Process., 2020, Vol. 14 Iss. 14, pp. 3300-3313 This is an open access article published by the IET under the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0/) Classifying a medical image aims to label the image according to its produced modality or label an illustration according to its attributes [17]. CAD systems or other medical imaging applications accurately segment two-dimensional (2D), 3D, and 4D medical images into isolated anatomical objects of interest [18]. Also, object detection should be logically completed before a segmentation task.
Deep learning for medical images has been the topic of several review articles. For example, Litjens et al. [7] surveyed deep learning for medical images, focusing on classification, detection, and segmentation. Shen et al. [19] focused on reviewing deep learning methods for anatomical and cell structures detection, medical image registration, tissue segmentation, and computeraided disease diagnosis or prognosis. Ker et al. [20] summarised key results on localisation, segmentation, registration, detection, and medical image classification. Analysing the principles of deep learning, highlighting the popular CNNs, and presenting a framework for image classification and segmentation, Tian et al. [21] reviewed the deep learning methods for medical image analysis. From a different perspective, Ravi et al. [22] reviewed the applications of deep learning in health informatics.
All the existing reviews [7,[19][20][21][22] on deep learning in medical images focused on either the fundamentals of deep learning and the popular CNNs, or applications in anatomical or structural image analysis such as detection, segmentation, and classification. However, functional nuclear imaging modalities have rarely been considered in the deep learning domain by existing review articles so far.
In this paper, we attempt to review existing work on classifying functional nuclear medicine images by using CNN-based techniques. Specifically, we first classify existing work into three different categories according to the diseases of concern. For each category, we then present details about the research objectives, the adopted various CNN models, and the obtained main results. Finally, we discuss research challenges and directions for developing technological solutions for the classification of functional nuclear medicine images based on CNN models. To the best of our knowledge, this is the first survey specifically on classifying functional nuclear images by using CNN techniques. For the work on CNN-based classification of structural and other functional medical images please refer to [7,[19][20][21][22].
The rest of this article is organised as follows. An overview of functional nuclear imaging and CNN technique is presented in the next section. Existing work is categorised and analysed in Section 3. The research challenges and future directions are discussed in Section 4.

Overview of nuclear imaging and CNN
In this section, we first provide an overview and some background of nuclear imaging in terms of the main imaging modalities and the functional nuclear medicine image data format. Then the CNN technique will be elaborated.

Nuclear imaging
2.1.1 Nuclear imaging modalities: As mentioned previously, nuclear imaging is a type of non-invasive medical imaging. The main modalities of nuclear imaging include SPECT, PET, and hybrid modalities with CT and MRI [4]. The 'image' acquired by SPECT and PET imaging is stored in a file that is, in essence, a matrix. The elements in this matrix are radiation dosages, which differs from the natural images in which the pixel values reflecting brightness ranges from 0 to 255. Furthermore, the resolution of a functional nuclear image is relatively low. A whole bone scan image, for example, has a size of 256 (width) × 1024 (height). It is thus more difficult to construct comparable and high-performance classifiers, compared to the structural and other functional medical imaging techniques [3], for the classification of diseases. Table 1 provides an overview of nuclear functional image modalities in terms of the scanning time, resolution, price, and quality of data, according to the findings in [1,2,4].
SPECT, as a most well-established nuclear functional imaging modality, has been widely used since the early 1990s. Over 18 million SPECT scans are conducted each year in the United States (MEDraysintell Nuclear Medicine Edition 2017, http:// www.medraysintell.com/). A SPECT scanner gamma camera costs from $0.4 million to $0.6 million (https://www.dicardiology.com/ article/spect-scanner-vs-pet-which-best). The low resolution and long scan time make it unclear if the areas of enhancement are tumours or inflammation, and are prone to both artefacts and attenuation. As such, it is likely some artefacts could be misidentified as perfusion defects. In addition, SPECT cannot estimate the blood flow, either. Currently, the above issues of SPECT have been partially resolved due to technological progress. For example, these techniques are as follows: the use of visual tracking systems to monitor patient movement during long scans, improved cameras, computer-aided image enhancement, and cutting scan times with triple-headed cameras.
PET is another nuclear functional imaging modality for the quantification of radioactivity in vivo. As a combination of nuclear medicine and biochemical analysis, PET is used mostly for patients with brain or heart conditions and cancer. PET is insensitive to artefacts, providing the quantitative estimate of blood flow. However, PET equipment is more expensive than SPECT.
Although the SPECT and PET imaging modalities are invaluable, the quality of their obtained data is either poor or noisy, with the limited imaging spatial resolutions. For this reason, SPECT and PET scans are often combined with CT and MRI imaging, allowing the correlation between functional and structural imaging (called hybrid imaging).
As mentioned above, a medical image includes both functional and structural information. MRI and CT images capture anatomical and structural information with high-spatial resolutions. SPECT and PET images, on the other hand, provide functional information with a low-spatial resolution. By a combination of PET or SPECT and MRI images, a more informative image can thereby be obtained with both structural and functional information [23]. As such, a hybrid image is very useful for the non-invasive diagnosis of diseases [23]. SPECT-CT and PET-CT imaging can provide anatomic signposts to clinicians so that the affected tissue in a nuclear image is accurately located and identified. Similarly, CT images can also help correct attenuation. These hybrid machines are very expensive, however. One PET-CT scanner can cost around $2 million (http://ais-nuclear.com/). Currently, only a few companies including GE, Philips, and Siemens manufacture PET-CT and SPECT-CT machines due to their high cost. As a nuclear imaging modality, SPECT-CT is still in its infancy (http://aisnuclear.com/).
The increasing use of functional nuclear imaging systems accelerates the growth of the global nuclear imaging equipment market. Analysts predict that the compounded annual growth rate of the nuclear imaging equipment market will be over 6% by 2023 (Global Nuclear Imaging Equipment Market 2019-2023, https:// www.reportlinker.com/p05775119/Global-Nuclear-Imaging-Equipment-Market.html).
Note that the hybrid imaging-based works are also reviewed in the following.
2.1.2 Nuclear imaging data format: Image data captured by nuclear imaging equipment are often stored in a digital imaging and communications in medicine (DICOM) file, according to the DICOM standard (https://www.dicomstandard.org/). A DICOM file has the file header and data set. Specifically, the header comprises of a file preamble with 128 bytes, followed by a DICOM prefix with 4 bytes. As an instance of an information object, a data set consists of data elements, each of which is identified by a data element tag. The standard elements of DICOM data have even-number groups while the elements of private data have odd-number groups to encapsulate specific information. However, it is recommended to use existing elements whenever possible (http://www.web3.lu/ dicom-standard/).

Convolutional neural networks
As a new field in machine learning, deep learning has been a tremendous success in a variety of applications over recent years. Through multiple processing layers, deep learning learns representations of data at multiple levels of abstraction [24]. Currently, most deep architectures rely on neural networks that consist of several layers of stacked neurons through backpropagation. A network with many layers is often called 'deep' or a deep neural network (DNN) [25].
In popular deep architectures, the stacked auto-encoders and deep belief networks [26,27] are trained in a layer-by-layer way. In contrast, the CNNs [28] and recurrent neural networks (RNNs) [29] including the long short-term memory (LSTM) [30], and gated recurrent units (GRUs) [31] are trained in a supervised fashion. The significantly reduced number of weights and the translational invariance of learned features enable CNNs to be trained end-toend. In addition, RNNs exploit the structure in the data. This is similar to the weight sharing in CNNs but using a sequential structure instead.
As one type of biologically-inspired deep model that resembles human vision systems, CNNs extract image features at different abstraction levels by using convolution operators, which are then applied to the response of the previous layer [32]. The CNN structure was originally proposed by Fukushima [33] in 1988 and the first successful application of CNN is the handwritten digit recognition by Yann et al. [34] in 1989 with a gradient-based learning algorithm. Afterwards, further improvements of CNNs have been achieved, with state-of-the-art results reported for a variety of image recognition, segmentation, detection, and retrieval related tasks [35]. Currently, DNNs have been a common choice in computer vision. The most successful models for image analysis are CNNs [7].
As mentioned above, CNNs become far more ubiquitous in (medical) image analysis, due to their weight sharing that exploits the intuition of similar structures occurring in different locations in an image. CNNs are specifically designed for images by introducing position invariance to pattern recognition. As a result, CNNs have been demonstrated to classify objects regardless of their size, orientation, angle etc. [36]. A CNN-based CAD system can automatically learn the features from images in an optimal way that transforms inputs to outputs (e.g. disease present/absent). This makes the CNN-based CAD systems more popular than those are built on traditional machine learning techniques, where handcrafted features were extracted by human researchers.
In the following section, we provide a brief introduction of CNN architectures and the main CNN classification models.

CNN architecture:
Since the first successful application of CNN in handwritten digit recognition [34], many innovations in CNN architecture have been achieved, including parameter optimisation, regularisation, structural reformulation etc. [37]. Using a group of images from the open database ImageNet [38], Fig. 1 depicts the overall architecture of CNNs consisting of four main types of layers: convolutional layer (Conv), non-linear processing or activation function layer (Act), subsampling, or pooling layer (Pool), and fully connected or dense layer (FC). Each layer in this deep architecture performs transformations by leveraging a number of convolutional kernels called filters.
The alternate layers of the convolutional, activation function, and pooling layers perform a feature extraction task, while the fully-connected layer and SoftMax function are responsible for the final classification or regression task.
Feature extraction: At this stage, each layer of the CNN accepts as its input the output from its immediate previous layer and produces its immediate output as the input to the next layer. The outputs of the convolutional and pooling layers are grouped into a 2D plane, which is called feature mapping generating a feature map.
Classification: At this stage, the extracted features (feature maps) are inputs for the dimension of the weight matrix of the final neural network.
In the following, we provide a brief introduction to the basic CNN architecture.
Convolutional layer: This layer is a set of convolutional kernels that are associated with a receptive field (a small area of an image). A convolutional layer divides an image into receptive fields and convolves over them with a set of specific weights [39]. For a given input image I x, y , the k-th layer output of I x, y after a convolution operation that the l-th convolutional kernel corresponds to is in (1) where K is the l-th convolutional kernel of the k-th layer, and x, y are spatial localities. Several important parameters are used in a convolution operation. The first one is a stride that determines how big the gap between two scanned regions is when the convolutional kernel starts to slide over a feature map from the left to right. As an example, the parameter stride depicted in Fig. 2 is 1 for the 5 × 5 convolutional kernel. The second parameter padding is used to determine how many additional borders of 0 values around the original input image, to make the output have the same size as the input. For an input image I in_x, in_y (i.e. in_x × in_y), let k size be the size of a receptive field (i.e. k size × k size convolutional kernel) and s and p be the defined parameters stride and padding. Then the output of O out_x, out_y can be calculated according to (2) Non-linear processing layer: this layer brings non-linearity into the network by using an activation function. In this way, a neural network is enabled to approximate arbitrarily complex functions [40]. In contrast, multiple layers of a neural network are equivalent to a single layer of the network if not using the non-linearity from an activation function [41]. The input of a non-linear processing layer is the output of its immediate previous convolution layer.
Let C be the output of a convolution layer. The activation function φ for a non-linear processing layer can be written as As depicted in Fig. 2, a rectified linear unit (ReLU) function makes the elements with a value < 0 become 0. ReLU is among the most commonly used activation function, which is defined by (4) [42] ReLU where x is a pixel value in an image.
Other common activation functions are the sigmoid function, hyperbolic function, and several variants of ReLU.
A sigmoid function monotonically increases to asymptote to some finite value. The mathematical representation of a sigmoid function is as follows: Hyperbolic function (tanh) allows a faster convergence, and its output is broader with an average around zero. Equation (6) is the mathematical representation of the hyperbolic function [34] tanh Differing from the ReLU function, the sigmoid and hyperbolic functions would experience vanishing gradient. This means that the gradients of an activation function in neurons whose output is near the asymptotes are nearly 0 during a backpropagation through a network. Thus, the weights in these neurons are not updated. Accordingly, the weights of neurons connected to such neurons are also updated slowly.
In practice, there are multiple variants of the ReLU function for different purposes, mainly including parametric ReLU (PReLU) [43], leaky ReLU (LReLU) [44], and exponential linear unit (ELU) [45]. They are mathematically defined as follows where α in (7)-(9) is a constant. It is worth noting that PReLU will degrade into ReLU and PReLU if α = 0 and α is a very small constant (e.g. 0.01), respectively. The ReLU function and its variants are preferred. This is because it avoids the vanishing gradient [46]. Fig. 3 shows graphical representations of several typical activation functions. Subsampling layer: this layer captures similar information from the neighbourhood of a receptive field by outputting a dominant response within this local region. In this way, the input can be invariant to geometrical distortions [28,47]. The main types of pooling formulations include the max and average pooling. Specifically, a max (average) pooling partitions an input image into a set of sub-regions, and outputs the maximum (average) value for each such sub-region.
For the example in Fig. 2, the output of the first 3 × 3 subregion coloured with bright blue after max pooling is 9.75, which will be 1.75, i.e. (5.99 + 9.75)/9, if the average pooling is used.
Fully connected layer: This layer makes a non-linear combination of selected features at the end of the network. As a global operation, a fully connected layer takes inputs from the previous layer and finally produces outputs of all previous layers [32,48]. The network output nodes normally use the SoftMax function for the number of n unordered classes.
A SoftMax function is defined as where f (x j ) is the score of the jth output node, x j is the net input to the j-th output node, and n is the number of output nodes. In fact, all of the output values f (x) are a probability, which is between 0 and 1, and their sum is 1 [49]. Intuitively, if the value f(x j ) of x j is larger than others, then its component approximates 1. If n = 2, the classification model becomes a regression model. The SoftMax function is plotted in Fig. 3.
In addition, different mechanisms such as batch normalisation and dropout are also used to improve CNN performance [39].

CNN classification models:
There are marvellous innovations in CNN architectures over the past two decades. However, the significant improvement in CNN performance results from designing new blocks and restructuring processing units. CNN-based applications have become prevalent after the marvellous performance of AlexNet [50] on ImageNet dataset [38].
As we know, a classification algorithm makes use of a function f mapping from the input variable x into categorical output variable y. In other words, y is a category predicted by the mapping function, y = f(x). With a single or several input variables, a classification model will predict their probabilities of two or more categories.
Several review articles have focused on classifying the architectures of CNNs [37,51]. A taxonomy of deep CNN architectures has been presented in [37], which includes seven different classes (see Fig. 4).
Spatial exploitation-based CNNs: various filter sizes reflect the different abstract levels of information granularity. Usually, the small-size filters extract fine-grained information while the large ones capture coarse-grained information. A CNN can perform well on both fine-and coarse-grained details by adjusting their filter sizes. This category of CNN models include LeNet [33], AlexNet [49], ZEFNet [52], VGG [53], and GoogLeNet [54].
Depth-based CNNs: There is an assumption that with several non-linear mappings, CNNs can better approximate a target function and the further improved feature representations, as the network depth increases. Therefore, the network depth plays an important role in performance. The depth-based models include highway networks [55], ResNet [56], Inception-V3/4 [57,58], and ResNext [59].
Multi-path-based CNNs: The concept of multi-path connectivity has been proposed for training deeper networks. Connections of multiple paths are used to connect one layer to another directly by skipping some intermediate layers. As such, the flow of information is allowed across the layers. Different types of shortcut connections have been developed, resulting in diverse multi-pathbased models such as highway networks [55], ResNet [56], and DenseNets [60].
Feature map exploitation-based CNNs: The success of a CNN results mainly from its ability of hierarchical learning and automatic feature extraction. Selecting class-specific feature maps is critical for improving the generalisation of the network. The exploitation of the feature map includes squeeze and excitation network [65], competitive squeeze, and excitation networks [66].
Channel exploitation-based CNNs: The representation of input is critical in determining the performance of a CNN. The lack of diversity and absence of classes in inputs will affect CNN's performance. The concept of auxiliary learners is introduced to a CNN to enhance the input representation. This is called channel exploitation-based CNNs such as channel boosted CNN using transfer learning [67].
Attention-based CNNs: attention has been incorporated into a CNN to improve representation. It enables a CNN to recognise objects even with cluttered backgrounds. Several attention-based models have been developed including residual attention neural network [68], convolutional block attention module [69], and concurrent spatial and channel excitation mechanism [70].

Hyper-parameters of deep models:
A DNN has several hyper-parameters ranging from learning rate, pruning rate, number of network layers, batch size, number of nodes per layer, momentum factor, number of training rounds, and learning rate decay rate to regularisation coefficient. In particular, the number of neurons, the number of hidden layers, dropout regularisation, and activation function are the most critical hyper-parameters [71,72]. In the following, we list the findings [72] on how hyper-parameters affect the performance of DNNs.
• ReLU activation function is commonly used since it performs better than sigmoid or tanh. • The hidden layers should contain at least two or three layers for achieving better performance. • How to determine the number of neurons in each hidden layer depends on the dataset used and which should be optimised. For obtaining the best results, at least 100 neurons per hidden layer should be used when a larger number of neurons is used, such as 600, 1000, 2000, or higher. • Dropout regularisation should be applied to both input and hidden layers. This is because dropout has a great impact on the performance of DNN by preventing co-adaption of neurons. • For CNN models, no pre-training is performed. The weights and biases are randomly initialised from a Gaussian distribution. • Given the local computing capability, the number of epochs should be as high as possible, at least 300.
As a typical deep architecture, CNNs have the effects of hyperparameters on their performances on classifying medical images, especially for functional nuclear images.

State-of-the-art
In this section, we present existing work by grouping them into several different categories according to the diseases of concern.

Neurodegenerative diseases (NDDs)
NDDs such as Alzheimer's disease (AD) and Parkinson's disease (PD) are characterised by the progressive loss of structural or neural functions, which are often associated with neuronal death [73]. SPECT, PET, and hybrid modalities are widely used in diagnosing NDDs. Mild cognitive impairment (MCI) is the prodromal stage of AD. This stage is classified further into a progressive state and a stable state. Tables 2 and 3 review existing works on classifying AD and PD, respectively.

Alzheimer's disease:
Iizuka et al. [74] used deep learningbased image classification to diagnose dementia with Lewy bodies (DLBs) and AD. Brain perfusion SPECT images were from 80 patients with both DLB and AD, as well as 80 individuals with normal cognition. By being trained on brain surface perfusion images, a CNN used a gradient-weighted class activation mapping for the visualising features that were captured by the trained CNN.
Huang et al. [75] proposed a 3D CNN-based model for diagnoses of AD, in which multi-modality information from both MRI and PET images of a hippocampal area was integrated. The proposed network was examined via training the classifier with paired MRI and PET images from the ADNI data set (ADNI data set: http://adni.loni.usc.edu/). The selected datasets consist of 731 cognitively unimpaired subjects, 647 with AD, 441 with a stable state of MCI (sMCI), and 326 with a progressive state of MCI (pMCI). In particular, the authors claimed with their results that segmentation is not a prerequisite for a CNN classification and a combination of two modality data can generate better results.
Focusing on diagnosing AD, especially on its early stage (e.g. pMCI and sMCI) in clinical practice, Feng et al. [76] proposed a simple 3D CNN architecture to obtain the deep feature representation of MRI and PET images from the ADNI dataset. In their recent work [77], a CNN-based deep learning framework was designed. In this framework, 3D CNN is used together with fully stacked bidirectional LSTM (FSBi-LSTM). In particular, hidden spatial information was extracted from deep feature maps to further improve the performance of FSBi-LSTM.
For classifying AD, Cheng and Liu [78] hierarchically learned the multi-level image features by constructing a cascaded 3D-CNNs from brain PET images in ADNI datasets with 100 normal controls (NCs) and 93 AD patients. Based on local patches of images, the multiple deep 3D-CNNs transformed an image into compact high-level features. A deep 3D CNN is then learned in a way that the high-level features are ensembled for the final classification.
Constructing cascaded CNNs for classification of AD, Liu et al. [79] learned the multimodal and multi-level features of PET and MRI brain images. Similar to work in [78], multiple deep 3D CNNs for different local patches in an image first represented the brain image as more compact high-level features. These high-level features learned from the multi-modality were then ensembled from an upper high-level 2D CNN, with the SoftMax layer. As such, the latent correlation features of corresponding patches are extracted from images. Finally, the learned features are combined with a fully connected layer before the SoftMax layer. The proposed method was evaluated against PET and MRI images of 397 subjects with 204 MCI (76 pMCI + 128 sMCI), 100 NC, and 93 AD patients from the ADNI datasets. In their other work [80], 2D CNNs were used to capture the features of image slices while GRU is for learning and integrating the inter-slice features. The experimental evaluation was conducted on PET images for 339 subjects with 146 MCI, 100 NC, and 93 AD patients from the ADNI dataset.
Choi and Jin [81] developed a 3D CNN-based automatic image interpretation system that can accurately predict their future cognitive decline of MCI patients by using PET images in the ADNI dataset of 139 patients with AD, 171 MCI, and 182 NC. A deep CNN was trained by using 3D PET images of NC and AD.
Using the shearlet-based deep CNN, Jabason et al. [82] proposed a classification algorithm for discriminating patients with AD, early MCI, late MCI, and NC in PET images, which included 742 NC, 709 early MCI, 577 late MCI, and 177 AD from the ADNI dataset. The shearlet transform was integrated into conventional CNN to incorporate the multi-resolution details of the data. After the model was pre-trained by transforming inputs into the better-stacked representation, the resulting layer was inputted to the SoftMax classifier, which then returned the probability of each class.
As a pre-defined scoring system, the score of brain amyloid plaque load (BAPL) could visually assess a subject's amyloid deposition in her brain by using 18F-florbetaben. BAPL 1 was considered as the amyloid beta (Aβ)-negative status, while BAPL 2 and BAPL 3 indicated the Aβ-positive status. Kang et al. [83] designed a CNN model to predict the status of Aβ-positive and Aβnegative. The brain PET images were acquired on NC and 176 patients with MCI and AD. The visual assessment of slice-based samples was achieved by a VGG-16 model.
Using 1272 PET and MRI images from the ADNI dataset, Vu et al. [84] proposed a deep learning method that fuses multimodalities to diagnose AD and MCI. Both sparse autoencoder (SAE) and CNN are used to train and test on the combined PET and MRI data, taking advantage of multi-modalities to provide complementary information than only one.  architecture by using different intensity and spatial normalisation of pre-processing, four different CNN models have been built [86] to answer a question as to how well CNNs can tell the differences between intensity and spatial when analysing nuclear brain imaging. The results have shown that a sufficiently complex model such as the 3D version of AlexNet can effectively express spatial differences.
From the same research group [85], Ortiz et al. [87] classified positive and negative cases of PD by using 269 3D SPECT images from the PPMI dataset. Based on LeNet-5 and AlexNet, two architectures have been built to identify isosurfaces and to extract descriptive features from the images with the specified intensity or values in isosurfaces connect voxels.
Conclusively, in the context of applying CNNs to classifying functional nuclear medicine images, AD has been relatively widely studied than PD. However, most existing studies use only the open datasets, i.e. ADNI dataset for 11 of 13 studies on AD and PPMI dataset for all studies on PD. Thus, the limited number of datasets still challenges neurodegenerative disease image analysis. More details are reported in Tables 3 and 4.

Tumours and cancers
Differentiating images with physiological lesions from the normal ones is another research mainline in functional nuclear medicine image analysis. In the existing work, the most frequently considered diseases in this category are thyroid disease, head and neck cancer, lung cancer, cardiac disease, gliomas, lymphoma, and multiple myeloma. The literature in this category focuses on classifying or diagnosing nuclear medicine images of the diseases above based on various CNN models. We provide a complete comparison of the works above in Table 4.

Thyroid disease:
Having been identified as the secondlargest disease in the endocrine field, thyroid disease occurs when the thyroid gland does not function correctly.
Using SPECT images, Ma et al. [88] developed a CNN-based CAD system for diagnosing thyroid diseases. The images were collected from subjects with four kinds of thyroid diseases: hyperthyroidism, hypothyroidism, methylene inflammation, and Hashimoto's disease. Cross-layers were connected with channels in which trainable parameters were added to learn the feature weights of the previous layer. Dropout, mix-up, early stopping, and batch normalisation were employed to improve the performance of CNN.
The same research group in [88] has proposed a modified DenseNet architecture of CNN for the diagnosis of thyroid diseases [89]. Their new architecture is improved as a result of adding trainable parameters into each of the skip connection. The learning rate in the training is optimised by a flower pollination algorithm. A total of 2888 SPECT image samples were collected using Siemens SPECT ECAM from Heilongjiang Provincial Hospital, including 438 Hashimoto disease, 780 Grave's disease, 860 normality samples, and 810 sub-acute thyroiditis.

Head and neck cancer:
As a broad term, the cancers of head and neck refer to a group of biologically similar cancers. They start in the lip, paranasal sinuses, oral cavity, nasal cavity, larynx, and pharynx. Lymph node metastasis (LNM) is a significant prognostic factor in patients with head and neck cancer.
For predicting LNM in head and neck cancer, Chen et al. [90] proposed a hybrid CNN-based model that combines manyobjective radionics (MO-radionics) and 3D CNN through evidential reasoning (ER). The proposed model can predict the three classes of lymph nodes: normal, suspicious, and involved. The 3D CNN consists of convolution, ReLU, max-pooling, and fully connected layers that learn both local and global features. The final output was produced by fusing the 3D-CNN model outputs and MO-radionics through the ER approach. PET and CT images were taken from 31 patients with head and neck cancer who had participated in the infield trials. For all of the trial patients, both a nuclear medicine radiologist and a radiation oncologist reviewed their nodal status. These nodes were contoured on contrastenhanced CT guided by PET. The model was trained on the lymph nodes of the first 21 patients. In particular, these patients were 39 suspicious nodes, 53 involved nodes, and 30 normal nodes. The prediction performance was then validated on the remaining 10 independent patients with 9 suspicious nodes, 17 normal nodes, and 13 involved nodes.

Cardiac disease:
Cardiac disease is a broad term to describe disease condition that affects the heart, which can be congenital or acquired. Myocardial SPECT is a widely used noninvasive method that detects coronary artery diseases.
Togo et al. [91] investigated whether or not deep CNN-based features can capture the differences between cardiac sarcoidosis (CS) and non-CS by using polar maps. A total of 85 patients (33 CS and 52 non-CS patients) participated in their experiments. One radiologist reviewed the CT and PET images by using the left ventricle region for constructing polar maps. High-level features were extracted from these polar maps by using an Inception-v3 network. The performance was evaluated by applying these features to the CS classification. For comparison, both the standardised uptake value-based and the coefficient of variancebased classification methods were used.
By using a graph-based CNN (GCNN), Spier et al. [92] developed approaches for the detection of coronary artery disease. They evaluated both disease detection and localisation performance on SPECT images including 503 rest and 443 stress studies. Four architectures were evaluated for analysing stress and rest cases: a 1D fully connected neural network (CNN), a 2D CNN, a GCNN using Cayley filters, and a GCNN using Chebyshev polynomials. Disease detection was performed using the 17-segment model. Localisation performance was evaluated with the best model on 30 polar maps labelled on segments by an expert reader.

Glioma:
As a type of tumour, glioma occurs in the brain and spinal cord, beginning in gluey supportive cells (glial cells) surrounding nerve cells. By using a full 3D U-Net deep network, Blanc-Durand et al. [93] demonstrated the performance of automated detection and segmentation of the PET lesion. All of the dynamic PET brain images were realigned to the first dynamic acquisition, coregistered, and spatially normalised into the Montreal Neurological Institute template. By using manual delineation and thresholding, ground truth segmentations were obtained. The volumetric CNN was implemented by using a U-Net library with three layers for encoding and decoding paths.
Toyonaga et al. [94] applied the CNN technique to predicting tumour hypoxia from common imaging modalities, i.e. PET and MRI [fluorodeoxyglucose (FDG)-PET rather than fluoromisonidazole (FMISO)-PET] that are available in limited institutions. A total of 32 glioblastoma patients who underwent MRI, FDG-PET, and FMISO and before surgical intervention were reviewed in their study. Seven image series of FDG-PET, FMISO-PET, gadolinium-enhanced T1-weighted image (T1WI), nonenhanced T1WI, fluid-attenuated inversion recovery, T2WI, and DWI were analysed after SPM12 co-registered all the images to individual T1WI. The AlexNet-based model was built for the automated classification of hypoxia. The randomly selected 4/5 square regions are used for training and validating, while the remaining 1/5 square regions were for test.

Lung cancer:
Lung cancer refers to cancer occurring in the lungs, with two main types: small cell lung cancer (SCLC) and non-SCLC (NSCLC). The former is the most common type of lung cancer.
Kirienko et al. [95] presented a CNN-based model that classifies lung cancer lesions as T1-T2 or T3-T4 from staging FDG-PET/CT images. They included 472 patients in their study (353 patients for T1-T2 and 119 patients for T3-T4). The staging was pathological and clinical in 375 and 97cases, respectively. A bounding box on both CT and PET images, cropped around the lesion centre was used as the input to CNNs. The classified results were either correct or incorrect. Based on their study, the authors claimed that CNNs can assist in the staging of patients who are affected by lung cancer.

Lymphoma:
Lymphoma is a cancer of the lymphatic system affecting all the areas and other organs throughout the body.
For improving the sensitivity of image interpretation and lesion detection, it is important to identify the sites of normal FDG excretion sites of FDG excretion and physiologic uptake (sFEPU) and normal physiological FDG uptake. To represent the inter-class differences between sFEPU fragments and their inconsistent localisation information, Bi et al. [96] proposed a CNN-based method by using a multi-scale super pixel-based encoding via grouping individual sFEPU fragments into larger regions. Their method can extract highly discriminative features of images via domain transferred CNNs. A VGG-based network trained on ImageNet was built as a feature extractor that encodes PET super pixels into a feature vector with 96D.
For classifying mediastinal LNM of NSCLC, Wang et al. [97] compared the state-of-the-art deep learning method with four classical machine learning methods against CT and FDG-PET images. The four classical methods are artificial neural network, adaptive boosting, random forests, and support vector machines. In their study, the deep learning method refers to CNN. The five methods were compared experimentally by using PET and CT images of 168 patients with a total of 1397 lymph nodes. They concluded that a CNN is more objective and more convenient than the classical methods. This is because a CNN does not need feature calculation or tumour segmentation. A CNN, however, does not use import diagnostic features. It has been proved that for classifying small-sized lymph nodes, these features are more discriminative than texture features. Thus, incorporating diagnostic features into a CNN is a promising direction for future research in this area.
3.2.7 Multiple myeloma: As a type of blood cancer, multiple myeloma affects plasma cells. In multiple myeloma, cells of malignant plasma accumulate in the bone marrow, crowding out the normal plasma cells. The bone lesion must be assessed for the therapeutic and diagnostic planning of multiple myeloma. Except for anatomical changes, Ga-68-Pentixafor PET-CT can capture the abnormal molecular expression of chemokine to the chemokine receptor (CXCR).
Xu et al. [98] adopted a cascaded CNN to form a W-shaped deep architecture (W-Net) that takes advantage of multimodal information for detecting lesions. The first part of W-Net extracts skeletons from a CT scan while the second part detects and segments lesions. With the three-folder cross-validation, the network was tested on 1268 Ga-Pentixa for PET and CT scans of multiple myeloma patients. The results showed that W-Net can learn features from multimodal images for detecting multiple myeloma bone lesions. The preliminary study encouraged us to further develop the deep learning approach for the multiple myeloma lesion detection with a greater number of subjects.
Although some work has been done in the area of separating physiological lesions from the normal, CNN combining functional nuclear imaging for the diagnosis of physiological diseases is still in its infancy. So far, there are only a few works for each disease category.

Others
The work in this category aims to improve the quality of the nuclear medicine images or to predict the standardised uptake value of radionuclides by leveraging CNN techniques rather than for the classification purpose.

Avoiding patients to be misidentified:
For classifying patients by the sex, Kawauchi et al. [99] developed a simple CNNbased system from FDG-PET and CT images. These images were from 6462 consecutive patients who underwent whole-body CT and FDG-PET. Seventy per cent of the randomly selected images were used for training and validating, while the remaining 30% were for tests. For test images, the sex of 99% of patients was correctly categorised. An image-masking simulation was then performed to identify body parts that are significant for patient classification. As a result, the pelvic region had been identified as the most important feature through the simulation. Their findings demonstrate that a CNN-based system is effective in predicting the sex of patients, with or without age and body weight predictions. Patient misidentification in clinical settings is thereby prevented.

Predicting maximum standardised uptake value (SUVmax):
By using CT images from a PET-CT examination, Shaish et al. [100] examined whether a CNN can predict the SUVmax of lymph nodes in patients with cancer. For their experiments, consecutive initial staging PET-CT scans were collected from patients with pathologically proven malignancy. From the unenhanced CT portion of each PET-CT examination, two blinded radiologists selected 1 to 10 lymph nodes and recorded the SUVmax of lymph nodes. The inputs to a novel 3D CNN are cropped Lymph nodes and the primary tumour histology type while the output is predicted SUVmax. Two separate cohorts are for training and testing the CNN. With defining the 2.5 or greater SUVmax as FDG avid, two blinded radiologists classified lymph nodes as either FDG avid or non-FDG avid, from unenhanced CT images. This study also performed a logistic regression analysis. A total of 400 lymph nodes in 136 patients were used for training while 164 lymph nodes in 49 patients were used for testing. The performance will be improved if combined with the radiologist qualitative assessment. Based on the tumour histology subtype for patients with cancer and the unenhanced CT images, a CNN can predict with the moderate accuracy of the SUVmax of lymph nodes.

Increasing the image quality:
To address the problem of coarse blurred sinograms with large parallax errors associated with large crystals, Hong et al. [101] proposed a single-image super- resolution method that enhances the PET image resolution and noise for PET scanners by leveraging a deep residual CNN. They used a dedicated network architecture in the study. By doing so, it will result in more efficient for PET imaging. Relying on the transfer learning, their approach deals well with cases with the poor labelling and small set of training data. Its performance was validated against simulated data, Monte Carlo simulated data, as well as preclinical data. In addition, this approach uses external PET data as prior knowledge for training without requiring additional information during the inference. Meanwhile, the normal PET imaging framework can be seamlessly integrated into this approach. As such, it potentially finds applications in designing low-cost and high-performance PET systems. For obtaining high-quality PET images, it is needed for the standard-dose radioactive tracer. This causes the risk of radiation exposure damage. For both maintaining the high quality of PET images and reducing the patient's exposure to radiation, Xiang et al. [102] proposed a deep learning architecture that can estimate the high-quality standard-dose PET (SPET) image from a combination of the accompanying Tl-weighted acquisition from MRI and the low-quality low-dose PET (LPET) image. Using the two-channel inputs of T1 and LPET, they adapted the CNN to directly learn an end-to-end mapping between the inputs and the SPET output. By using the auto-context strategy, multiple CNN modules are then integrated in such a way that an estimated SPET from a CNN can be refined iteratively by subsequent CNNs. Validations on data of real human brain PET-MRI showed that the proposed method can produce the estimation quality of PET images.
Using a 3D residual CNN, Song et al. [103] proposed a denoising approach to low-dose SPECT images. For mapping lowdose images into their corresponding standard-dose ones, the proposed CNN was trained with clinical acquisitions. The proposed approach was validated against a set of 119 clinical acquisitions with the imaging dose reduced by four times. The results showed that their approach can effectively suppress the noise level in the reconstructed myocardium. [104] developed a deep CNN-based approach to estimate time-of-flight directly from the pair of digitised detectors. Their experimental setup used two photomultiplier tube-based scintillation detectors. The experimental results demonstrated that the CNN-based timeof-flight estimation improves timing resolution by 20% over the leading-edge discrimination, as well as by 23% over constant fraction discrimination. By comparing several different CNN architectures, it has further shown that a CNN depth had the largest impact on timing resolution. In contrast, the network parameters, such as the convolutional filter size and the number of feature maps, had only a minor influence.

Research challenges and directions
In this section, we point out the research challenges of classifying functional medical images with CNNs.

Lack of labelled image samples
Limited datasets greatly challenge nuclear functional medical image analysis. It is often difficult to build big medical image datasets. This is because of the rarity of diseases and patient privacy. In particular, medical experts are required for manually labelling images, as well as conducting medical images processes. This is very expensive and needs huge efforts [105].
The standard use of nuclear imaging has become widely accepted. In particular, SPECT, PET, and the hybrid modalities of SPECT-CT, SPECT-MRI, PET-CT, and PET-MRI are increasing in clinical use. Available open databases of nuclear medicine images are, however, also still insufficient. Differing from natural images, medical images are often collected in the opportunistic process of clinical examination, which cannot be performed by an experiment-specific procedure. Unfortunately, however, there is a relatively low prevalence of nuclear imaging scans in hospitals limited by the high cost of nuclear imaging equipment.
Another main challenge is that a limited number of labelled samples for training CNN models [106] is available. It is highly time-consuming and labour-intensive to annotate and label medical images. The lack of sufficient training data is especially challenging in nuclear medical image analysis. To sum up, nuclear image acquisition is difficult, while a quality annotation is costly.
Data augmentation and transfer learning are two lines of the current solutions in the literature to address the above issues: • Data augmentation: data augmentation is used to bake translational invariances (e.g. lighting, occlusion, viewpoint, scale, and background) into the data set. As such, the performance of the resulting models can be improved. The main data augmentation techniques are [105] geometric transformations, kernel filters, random erasing, colour space transformations, adversarial training, mixing images, feature space augmentation, generative adversarial network-based augmentation, and neural style transfer. Fig. 5 shows a taxonomy of image data augmentations [105]. • Transfer learning: as a popular approach, transfer learning [107,108] is suitable for improving the efficiency of deep network learning [109]. Using transfer learning, a CNN network previously trained on a big dataset (e.g. ImageNet [38]) for a related task can be fine-tuned by using the training examples for a new classification task. Fine-tuning of weights involves learning weights using the same learning algorithm. In terms of the numbers of both training epochs and training examples needed, a CNN can achieve more efficient learning by initialising its weights using those of an already learned CNN rather than random weights [38]. In particular, the weights in convolutional layers are typically copied instead of the entire network including fully-connected layers. The reason for this is that many images share their low-level spatial characteristics that are better learned with big data [105]. Therefore, transfer learning can be adopted to prevent overfitting, which may be caused by limited nuclear image datasets, by fine-tuning CNNs that are pre-trained from natural image dataset to medical images. For example, the transfer learning technique has been used in [89,96].
Other techniques that can successfully apply CNNs to classify medical images are as follows [110]: performing unsupervised pretraining on training the CNN from scratch, natural or medical images, and fine-tuning on medical target images using CNNs or other types of deep learning models. The approaches mentioned above are not mutually exclusive. On the contrary, they can be combined in an integrated and more effective framework.

Imbalance of image samples
We use imbalanced data to refer to a scenario in which the number of instances of one class is scanty in comparison with other classes. The imbalanced data may cause those classical classifiers to neglect minority class instances and emphasise on majority class, resulting in a skewed classification accuracy [111]. In other words, classifiers on imbalanced data are more sensitive to detecting the majority class and less sensitive to the minority class [112]. The problem of an imbalanced dataset is frequently present in healthcare applications. In particular, at least one class constitutes only a very small minority of the data [113].
The distribution of medical images in nuclear medicine depends on the patients in terms of the type of disease. The PET and SPECT images collected in the process of clinical examination may have an imbalanced distribution unlike the natural image datasets such as ImageNet [38], MNIST [34], COCO [114], and PASCAL Visual Object Classes [115]. The imbalanced training samples mean that a large number of examples are available for one disease class while only a few are for others. As a result, this may deteriorate the classification accuracy, particularly for those patterns belonging to the less represented classes [116].
IET Image Process., 2020, Vol. 14 Iss. 14 For conventional machine learning tasks, a lot of work on alleviating the class imbalance problem has been done [111]. Data level resampling is one of the many ways to handle the class imbalance problem. Cost-sensitive learning techniques use a cost matrix for different types of errors to aid learning from imbalanced data sets; Kernel-based methods have also been used to tackle class imbalance problem. The effects of imbalance on CNN have been systematically studied [117], resulting in two categories of methods for addressing the class imbalance, i.e. data-level methods and classifier level methods. Oversampling is the best-suited resampling technique and does not lead to overfitting in CNNs [117].
However, how to accurately classify nuclear images with the imbalanced training samples is still a challenging issue in functional nuclear medical image analysis.

Raw images accompanied by artefacts and pitfalls
Radiopharmaceuticals have widely accepted the safe class of drugs with few adverse reactions and unexpected biodistributions. However, there are still problematic. This is because of technical issues in manufacture or reconstitution, spatial misalignment of SPECT and CT modalities, the choice of CT acquisition protocols, drug administration, data corrections, and patient preparation [118,119]. It is reported that SPECT-CT imaging is often accompanied by misregistration, respiration, truncation, highly attenuating foreign bodies, CT noise, and thick CT slides due to different causes [120]. All these artefacts and pitfalls challenge medical image analysis to some extent.
Motion artefacts, as one type of artefact, are common in nuclear images. A PET-CT scan takes about 30-min. During the scan, a patient may fall asleep while a system acquires the position of a bed. When the bed shifts to the next scanning position, the patient is often startled. As such, an image is marred by motion artefacts. A small change in a patient's position, the orientation status between a SPECT scan and CT scan can result in the misregistration.
If a bone scan is performed right after a treatment, it will be difficult to distinguish tumour progression from a flare response [121,122]. A flare response, i.e. pitfall, may last for as long as 6 months after therapy [121,122]. In addition, if a patient has undergone recent surgery, such as knee replacements, radionuclide bone scintigraphy will produce false-positive results.
It is important to recognise normal variants as they can mimic pathology. This is because the pattern of tracer uptake in the sternum, head, and neck region is variable. Normal variants have increased tracer uptake at the occipital protuberance, at the confluence of sutures, and at the pterion in the skull [123][124][125]. In addition, it is tedious and error-prone to detect the whole-body of dozens of lesions on hybrid imaging.

Unperfected CNN models
Although DNNs have garnered tremendous success in a variety of applications, CNNs are far from perfect at present. There are still some challenges associated with the use of deep CNN architecture for machine learning related tasks, some of which are given below [37].
First, like a black box, CNN models generally lack interpretability and explainability [19]. For vision-related tasks, CNNs may provide little robustness against noise and other alterations to images.
Second, CNN can automatically extract problem-specific features related to a given task. For some tasks, however, it is necessary to know the nature of the extracted features before a classification process. For medical image analysis, a CNN does not use the important diagnostic features that have been proved more discriminative. Thus, incorporating the diagnostic features into a CNN is a promising direction for future research and the feature visualisation technique in CNNs can help in this direction.
Third, current research efforts on deep CNNs focus mainly on supervised learning; however, available large and annotated datasets are far from abundant for training supervised CNN learning models. Fourth, the CNN performance is tightly related to parameter selection. Any little changes in the selected parameters can affect the overall performance of a CNN. Therefore, the major issue of a careful selection of parameters needs to be addressed.
Finally, the efficient training of CNN relies on powerful hardware such as graphics processing units. However, there is still much room for investigating how to efficiently apply CNNs in embedded and smart devices.

Conclusions
In this paper, we have presented a comprehensive review of existing literature on the classification of nuclear medicine functional images that reflect functional changes in pathological tissues by using CNN techniques. After the introduction of nuclear imaging and CNN techniques, involving nuclear imaging modalities, nuclear image data format, CNN architecture, and the main CNN classification models, an overview of existing work has been elaborated by classifying existing work into three different categories according to the diseases of concerns. For each category, the details about the research objectives, the adopted CNN models, and the obtained main results have been presented. The research challenges have also been discussed for providing a reference for future work on CNN-based nuclear medicine image classification.

Acknowledgments
This work was partially supported by the National Natural Science