Recurrent and convolutional neural networks for sequential multispectral optoacoustic tomography (MSOT) imaging

Multispectral optoacoustic tomography (MSOT) is a beneficial technique for diagnosing and analyzing biological samples since it provides meticulous details in anatomy and physiology. However, acquiring high through‐plane resolution volumetric MSOT is time‐consuming. Here, we propose a deep learning model based on hybrid recurrent and convolutional neural networks to generate sequential cross‐sectional images for an MSOT system. This system provides three modalities (MSOT, ultrasound, and optoacoustic imaging of a specific exogenous contrast agent) in a single scan. This study used ICG‐conjugated nanoworms particles (NWs‐ICG) as the contrast agent. Instead of acquiring seven images with a step size of 0.1 mm, we can receive two images with a step size of 0.6 mm as input for the proposed deep learning model. The deep learning model can generate five other images with a step size of 0.1 mm between these two input images meaning we can reduce acquisition time by approximately 71%.


| INTRODUCTION
Multispectral optoacoustic tomography (MSOT) is an in vivo optical imaging modality for molecular, anatomical, and functional imaging fields [1,2].The principle of MSOT is based on the optoacoustic effect, that is, a molecule is excited by an ultrashort laser pulse, which can penetrate through tissue several centimeters [3,4], resulting in thermoelastic expansion surrounding the molecule that generates a photoacoustic wave [5].The ultrasound traducer is then used to detect this wave as an ultrasound signal.The difference in absorption contrast of tissue in single wavelength images is employed to reconstruct anatomical images.Using multiple wavelengths to excite the tissue, we can obtain multispectral images from intrinsic and extrinsic signals.A laser between 680 and 980 nm is the predominant source for intrinsic signals such as deoxygenated hemoglobin, oxygenated hemoglobin, melanin, myoglobin, bilirubin, fat, and so forth.Extrinsic signals do not usually occur in cells, tissue, or animals.Agents that can absorb in the near-infrared range such as indocyanine green (ICG), fluorescence proteins, nanoparticles, and so forth, can increase the optoacoustic signal (extrinsic signal).Thus, they can be distinguished from intrinsic tissue background signals by using effective spectral unmixing algorithms such as linear regression, guided independent comment (ICA), and principal component analysis [6,7].MSOT is widely used for several studies such as cancer research [8][9][10][11][12], drug development [13,14], and nanoparticle [15][16][17][18].However, using multiwavelength excitation to scan the sample is time-consuming, especially crosssectional scanning for 3D image reconstruction.Imaging needs to sweep all the wavelengths with every single scanning position.For in vivo experiments, this might lead to image degradation from motion artifacts and potential lethality from prolonged anesthesia.In recent years, deep learning-based approaches have played a vital role in optoacoustic imaging, and they have been widely used in several applications such as image classification, segmentation [19][20][21][22][23], quantitative photoacoustic imaging [24][25][26][27][28], image enhancement [29][30][31][32][33], and so forth.One main advantage of deep learning for those applications is that it depends less on hardware modifications.In addition, most of those deep learning techniques were designed to use a single 2D image as their input and apply convolution architectures for feature extraction.For instance, deep learning for automatic segmentation of optoacoustic ultrasound images [34] used the U-Net architecture [35] to perform image segmentation.U-Net is a well-known convolutional neural network (CNN) for image segmentation, particularly biomedical images [36][37][38][39].
Nevertheless, there are no techniques based on deep learning to reduce the acquisition time of cross-sectional scanning for 3D optoacoustic imaging.Herein, we propose the hybrid architecture of CNN and recurrent neural network (RNN) for generating sequential optoacoustic, unmixed optoacoustic of a specific contrast agent, and ultrasound images to extend the stack of cross-sectional images and reduce acquisition time by approximately 71%.This hybrid architecture is called inception generator long short-term memory (I-Gen-LSTM).The I-Gen is a CNN model designed based on the inception U-Net architecture.Inception is a convolutional layer [40] that convolves the input in parallel with different kernel sizes extracting more features than a simple convolutional layer.RNN is a robust and effective approach for sequential problems.It is a feed-forward neural network with internal memory and performs the same function for every data input.In addition, the output of the current input depends upon the previous output.However, the original RNN has drawbacks regarding exploding and vanishing gradients from backpropagation to update weights, particularly long sequential inputs.LSTM networks [41] are improved RNN networks capable of learning long-term dependencies by adding a forget gate, input gate, and output gate.Therefore, we leverage I-Gen and LSTM networks to generate sequential images.Our results demonstrate that the I-Gen-LSTM model is a versatile method that can generate not only sequential optoacoustic images but also sequential unmixed optoacoustic and ultrasound images.dimensional signals in the imaging plane.Figure 1A shows the detection and illumination geometry in the imaging chamber of the MSOT system.In addition, this system provides a tunable laser with a range of 660-1300 nm, which is particularly suitable for most biological samples.The excitation pulse laser is used to illuminate the sample.The sample absorbs this pulse and converts it to heat, which results in a transient thermoelastic expansion that generates an acoustic wave.The ultrasound transducer is then used to detect this acoustic wave, and the backprojection algorithm [42] is applied to the detected optoacoustic wave to reconstruct the images.For the dataset preparation, transgenic mice [43] with breast tumors were intravenously injected with ICG-conjugated superparamagnetic iron oxide nanoworms (NWs-ICG) [44], which accumulate in tumors longer than pure ICG through the enhanced permeability and retention effect [45].Twentyfour hours after injection, the mice were euthanized and the tumors were removed and dissected for this study.All procedures performed on animals were approved by the University's Institutional Animal Care & Use Committee and were within the guidelines of humane care of laboratory animals.To acquire images of the tumors, 4 mg of agarose powder was dissolved in 40 mL of warm deionized water.The breast tumor was put in this dissolved agarose solution, allowing approximately 15 min for the solution to solidify.The hardened agarose with the tumor inside shown in Figure 1B, was grasped by the holder and then scanned by the inVision MSOT system with the excitation pulse at wavelengths from 800 to 1000 nm (a comprehensive range of the NWs-ICG study).Since the inVision MSOT system can provide corresponding ultrasound images, NWs-ICG optoacoustic images obtained through linear spectral unmixing algorithm [46], and each single-wavelength optoacoustic image, these three imaging modalities were simultaneously acquired in every scanning position.Figure 1D1-D4 shows the ultrasound images of the breast tumors with different scanning positions, Figure 1E1-E4 shows the corresponding NWs-ICG optoacoustic images reconstructed from multispectral optoacoustic imaging with the excitation pulse at wavelengths from 800 to 1000 nm by using the spectral unmixing algorithm; Figure 1F1-F4 shows the corresponding single-wavelength optoacoustic image at 800 nm excitation and Figure 1G1-G4 shows the corresponding overlaid images of these three imaging modalities.

| I-Gen-LSTM and discriminator models
The I-Gen-LSTM model comprises three main neural networks depicted in Figure 2A-C.The first neural network is the Inception encoder and decoder network based on Inception U-Net architecture.The original U-Net architect employs simple convolution blocks with the skip connection of encoders and decoders at the same dimension helping the model to circumvent the vanishing and exploding gradients problems.However, the simple convolution blocks might be insufficient to extract all crucial information comprehensively.Inception architecture is one of the effective CNN architectures since it applies a wide range of kernel sizes to extract global and local features.A large and a small kernel size are tailored to extract information distributed globally and locally, respectively.With this attribute, the encoder and decoder network was designed using Inception U-Net as its backbone as shown in Figure 2A, for improving the model capability.This network takes two 2D images, acquired from an arbitrary consecutive position with a step size of 0.6 mm, as its inputs (input 1 and input 2, as shown in Figure 2A).The encoder shown on the left side of Figure 2A generates encoder outputs (E1n-E5n, where n is the input image number, i.e., 1 and 2).Inception architecture in the encoder with three different kernel sizes (1 Â 1, 3 Â 3, and 5 Â 5) assembled as the parallel filters are used to extract features from the tensors followed by a rectified linear unit (ReLU) and a 2 Â 2 max pooling with the stride of two steps for downsampling, respectively.Similarly, Inception architecture is also used in the decoder blocks.These decoder blocks are used to generate decoder outputs (D1n-D5n, where n is the input image number, i.e., 1 and 2) as shown in the right side of Figure 2A followed by a feature map upsampling, a 2 Â 2 up-convolution (halving the number of feature channels), and a corresponding concatenation from the encoder part.
The second neural network is the convolutional LSTM network (ConvLSTM) [47], which is an RNN for spatiotemporal prediction.It has a convolutional structure in both the input-to-state and state-to-state transitions as shown in the bottom right of Figure 2B.In other words, internal matrix multiplications are exchanged with convolution operations.Consequently, the data flowing through the ConvLSTM cells keeps the input dimension instead of being a 1D vector with features.The main equations of ConvLSTM are expressed in Equations ( 1)-( 5) below, where "*" and "๐" represent the convolution operator and the Hadamard product (element-wise matrix multiplication), respectively.All variables in Equations ( 1)-( 5) are shown in the "ConvLSTM block" in Figure 2B.
Lastly, it is the sequential image generator network inspired by a U-Net architecture.The model takes Recurrent Conv 1-5, two input images, encoder outputs (E11-E51 and E12-E52), and decoder outputs (D11-D41 and D12-D42) to reconstruct five sequential images of different scanning positions as shown in Figure 2C.The left side of Figure 2C shows the concatenated encoder and decoder outputs generated by the Inception encoder and decoder (Figure 2A).The right side of Figure 2C shows Conv2D transpose and Conv2D operations for the Recurrent Conv 1-5 generated by the ConvLSTM blocks (Figure 2B) and the concatenated encoder and decoder outputs.All Conv2D transpose, Conv2D blocks utilize ReLU as their activation function except the last Conv2D* that applies hyperbolic tangent or tanh as its activation function.Indeed, the Recurrent Conv blocks regulate the gradual change in the sequential output images.In short, the I-Gen-LSTM model takes two images acquired by consecutive positions with 0.6 mm steps size and generates the five sequential images between these two images with gradual change following the scanning positions (step sizes of 0.1-0.5 mm).The ground truth images acquired using a small step size (0.1-0.5 mm) were used to determine the loss value from these five generated images (GEN).The loss functions will be elucidated in Section.2.3.The discriminator network shown in Figure 2D is a simple convolution network designed to evaluate the similarity between the ground truths (GT) and GEN.The model comprises eight convolutional layers and two fully connected layers.After each convolution block, a batch normalization layer is used, followed by an activation function named the Leaky ReLU function (α = 0.2).The number of 3 Â 3 filter kernels increases by a factor of 2 from 64 (the first layer) to 512 (the eighth layer) kernels.The last two layers are dense layers working as a classification block, predicting the probability of an image being either real or fake.To train the I-Gen-LSTM model, we assemble the models as a generative adversarial network (GAN) [48] shown in Figure 3 below.

| Loss functions
To optimize the I-Gen-LSTM model, we designed custommade loss functions, namely the content loss (VGG19 loss, I SS VGG ) [49], adversarial loss (Discriminator loss, I SS Gen ), and neighbor loss (I SS N ) as shown in Equation ( 6).Where C w1 , C w2 , and C w3 are the hyper-parameters set as 0.7, 0.1, and 0.2, respectively.
The content loss or VGG loss (I SS VGG Þ, which is defined as the Euclidean distance between the feature map of the generated image G θG I LS À Á À Á and the ground truth (I SS Þ, can extract high-dimensional features helping the model to generate the image with perceptually satisfying solutions without excessively smooth textures.The I SS VGG loss is based on the ReLU activation layers of the pretrain 19-layer VGG network and it can be calculated following Equation ( 7) as shown as where W i,j and H i,j describe the dimensions of the respective feature maps within the VGG network.The features map (Ø i,j ) can be obtained by the jth convolution before the ith max pooling layer within the VGG19 network.Moreover, the adversarial loss (I SS Gen ) is also employed to distinguish the similarity of the two images.It is defined as the probabilities, varying from 0 to 1, which are the result of the discriminator model )) as shown in Equation (8).Where I LS is the input images, G θ G is the generator model, and D θ D is the discriminator model Apart from using the content and adversarial losses, the neighbor loss is also applied to optimize the model.Since the I-Gen-LSTM model generates sequential images, the neighbor loss is essential to regulate the change of each generated image in the sequence.The concept of the neighbor loss function is to differentiate between the current generated image (I n ) and the neighbor images (I nÀ1 and I nþ1 Þ in the same sequence as expressed in Equation ( 9) below as The custom-made loss function effectively leverages the combination of these three loss functions to train the I-Gen-LSTM model that can generate high-quality sequential images.

| I-Gen-LSTM model for volumetric imaging
To collect the database for training the model, 16 breast tumors from mice intravenously injected with NWs-ICG were acquired by the MSOT system.The data from these tumors were allocated for training (11 tumors), validation (3 tumors), and testing (2 tumors) datasets.The training time on Google Colaboratory (CoLab) Pro is approximately 40 hours.After initializing and importing the model, the I-Gen-LSTM can generate five sequential images by taking less than 1 s for the five output images on a personal computer with an 11th Gen Intel core i7-11700k CPU, 16 GB RAM, and an NVIDIA RTX 3090 graphic card.

| Sequential NWs-ICG optoacoustic, ultrasound, and optoacoustic (λ ex = 800 nm) image reconstruction
The breast tumor dissected from an NWs-ICG-injected mouse was scanned under the MSOT system.Figure 4 shows the generated sequential images generated by the I-Gen-LSTM model.Two input images of each modality, acquired from consecutive stage positions with a step size of 0.6 mm, are used as the inputs for the I-Gen-LSTM model.
Here, we demonstrate a z-scanning range from 9.7 to 10.3 mm with a step size of 0.1 mm as a representative.The red-dashed boxes in Figure 4 show local features, which are fairly changing along the z-scanning position and are somewhat straightforward to observe.The orange-dashed boxes are the corresponding enlarged images of the red-dashed boxes.Figure 4A shows the sequential image reconstruction result of NWs-ICG optoacoustic imaging, Figure 4B shows the result of ultrasound imaging, and Figure 4C shows the result of single-wavelength optoacoustic (λ ex = 800 nm) imaging.The average peak-signal-to-noise ratio (PSNR) dB/the average summation of absolute errors (SAE) between the GT and GEN for this scanning range of NWs-ICG optoacoustic, ultrasound, and optoacoustic (λ ex = 800 nm) imaging are 87.72 dB/923.66,78.83dB/4323.19,75.60 dB/ 2223.40,respectively.
3.2 | Three-dimensional reconstruction of the stack 2D NWs-ICG optoacoustic, ultrasound, and optoacoustic (λ ex = 800 nm) images Since the MSOT system and our deep learning model provide the stack of multiple cross-sectional images for NWs-ICG optoacoustic, ultrasound, and optoacoustic (λ ex = 800 nm) images, we can use these images to reconstruct three-dimensional (3D) images by using Amira (Mercury Computer system, Berlin, Germany) software.Figure 5 shows the 3D reconstruction results of the GT and the GEN. Figure 5A demonstrates the 3D reconstruction of the GEN from the I-Gen-LSTM model and Figure 5B shows the reconstruction of the GT acquired by mechanical scanning.After finished the experiment, the tumor was removed from the agarose and sent to the histopathology lab (MSU-IHPL Research facility) to prepare a Hematoxylin-and-Eosin (H&E) stained breast tumor slide shown in Figure 5C.

| Evaluations
The NWs-ICG optoacoustic, ultrasound, and optoacoustic (λ ex = 800 nm) images from two tumors not used for training the model were utilized for the model evaluation.Each tumor was scanned with a step size of 0.1 mm.Every two-image (with a 0.6 mm scanning step in between) was assigned as the input for the I-Gen-LSTM model to generate five sequential images with a step size of 0.1 mm.Here, the model was evaluated using four quantitative metrics: the average PSNR, SAE (GEN, GT), SAE Input 1 ,GT ð Þ , and SAE Input 2 ,GT ð Þ : They were applied to the testing dataset acquired from the tumors for all scanning positions.
A large PSNR and a small SAE (GEN, GT) imply high-quality GEN.Indeed, if the SAE (GEN, GT) can perform better than SAE (Input1-GT) and SAE (Input2-GT), it also means that the model can effectively generate sequential images.All average evaluation metrics can be calculated following Equations ( 10)-( 12) Average PSNR ¼ Average SAE GEN, GT ð Þ¼ Average SAE Input k ,GT ð Þ¼ where, N is the number of scanning positions with a step size of 0.6 mm, GEN i is the generated image at "i" scanning position in between two input images (acquired with a step size of 0.6 mm), GT i is the corresponding ground truth, Input k images are the two input images (k = 1 and 2) acquired from arbitrary consecutive positions with a step of 0.6 mm. Figure 6 shows the representative result from one of the evaluated tumors as the graph of the average PSNR and SAE (GEN, GT) versus scanning positions.Table 1 shows the average evaluation metrics of the generated sequential NWs-ICG optoacoustic, ultrasound, and optoacoustic (λ ex = 800 nm) images for all testing datasets.Overall, the average PSNR and SAE between GEN and GT of all modalities are greater than 75 dB and less than 2000, respectively.
This indicates that the I-Gen-LSTM model can generate sequential images with promising results.To comprehensively evaluate the model performance, we also compared SAE (GEN, GT) to SAE(Input 1 , GT) and SAE(Input 2 , GT) as the baseline for comparison as shown in Table 1 (bold values are the lowest SAE values for each modality).The average SAE (GEN, GT) of optoacoustic (λ = 800 nm) and ultrasound imaging performs better than the average SAE(Input 1 , GT) and SAE(Input 2 , GT) but the NWs-ICG optoacoustic imaging does not (the average SAE (GEN, GT) is slightly higher than the average of SAE(Input 1 , GT) and SAE(Input 2 , GT)) due to the tiny changing features in the sequential NWs-ICG optoacoustic imaging and the limited number of the training dataset.Although the overall result is favorable and encouraging, the deep learning model could be improved in future work.We will use a larger dataset with a larger image size to train the deep learning model so that the convolution/LSTM blocks can efficiently capture more sequential features, especially in a tiny changing feature modality such as NWs-ICG optoacoustic imaging.This work demonstrates a deep learning technique based on RNN and CNN architectures for generating sequential NWs-ICG optoacoustic (multispectral unmixing), ultrasound, and optoacoustic images.It has shown robust and promising performance in the accurate reconstruction of the sequential images for all modalities, according to the quantitative evaluation of model performance using the PSNR and SAE for all scanning positions of the GEN (reconstructed by the deep learning model) and ground truth (acquired by mechanical scanning).The architecture of our model is versatile since it can promisingly generate sequential cross-sectional images of three modalities from the commercial MSOT system.Using our deep learning can substantially reduce acquisition time.However, all the training data were acquired from ex vivo tissues completely fixed in agarose.Model performance with images acquired in vivo may be affected by cardiac and respiratory motion.In the future, we will explore the possibility of optimizing and applying the model to generate sequential images of in vivo samples with motion artifacts.
T A B L E 1 Average quantitative metrics of optoacoustic (λ ex = 800 nm), NWs-ICG optoacoustic, and ultrasound images generated by the proposed deep learning model.

2 | EXPERIMENTAL 2 . 1 |
Data acquisitionA commercial MSOT system (inVision 512-echo, iThera Medical GmbH, Munich, Germany) was used to acquire the data for training the I-Gen-LSTM model.The MSOT system has a 270 ultrasound transducer tomographic array, which can acquire signals from multiple angles around an object.This tomographic array enables the system for imaging complex shapes since it can capture two-F I G U R E 1 Ultrasound, NWs-ICG optoacoustic obtained through multispectral unmixing, and optoacoustic at 800 nm excitation imaging of an ex vivo breast tumor from a mouse intravenously injected with NWs-ICG.(A) The detection and illumination geometry in the imaging chamber of the MSOT system.(B) The breast tumor is embedded in agarose.(C) NWs-ICG structure.(D1)-(D4) Ultrasound images of the breast tumor with different step sizes.(E1)-(E4) The corresponding NWs-ICG optoacoustic images were obtained through multispectral unmixing.(F1)-(F4) The corresponding single-wavelength (λ ex = 800 nm) optoacoustic images.(G1)-(G4) with an overlay of the ultrasound, the NWs-ICG optoacoustic (colormap), and the single-wavelength optoacoustic images.

F
I G U R E 2 I-Gen-LSTM and discriminator architectures.(A) Inception encoder and decoder networks are applied to both images (input1 and input2).(B) ConvLSTM network for generating the sequential blocks (Recurrent Conv 1-5) fed to the sequential image generator network for reconstructing the sequential output images.(C) The sequential image generator network.(D) The discriminator network.

F I G U R E 4
Results of sequential image reconstruction generated by the I-Gen-LSTM model.The two input images for each modality simultaneously acquired with a step size of 0.6 mm were fed into the I-Gen-LSTM model.The green, blue, and violet boxes show generated images (GEN), ground truth (GT), and the absolute error between GEN and GT images (jGT-GENj) represented as color map images.The red-dashed boxes show the local features fairly change along the z-scanning position and the yellow-dashed boxes are the corresponding enlarged images of the red-dashed boxes.The scale bar is 5 mm.(A) NWs-ICG optoacoustic sequential image reconstruction result.(B) Ultrasound sequential image reconstruction result.(C) Single-wavelength optoacoustic (λ ex = 800 nm) reconstruction result.In addition, comparison results between ground truths and generated images (outputs of the I-Gen-LSTM model) of NWs-ICGoptoacoustic, ultrasound, and optoacoustic (λex = 800 nm) imaging are also shown in Video S1-3, respectively.

F
I G U R E 4 (Continued) F I G U R E 5 3D image reconstruction of the breast tumor using cross-sectional NWs-ICG optoacoustic, ultrasound, and optoacoustic (λ ex = 800 nm) stacked images.(A) The 3D reconstruction result of the NWs-ICG optoacoustic, ultrasound, and optoacoustic (λ ex = 800 nm) images generated by the I-Gen-LSTM model with a step size of 0.1 mm.(B) The 3D reconstruction result acquired by mechanical scanning with a step size of 0.1 mm.(C) The photograph of the corresponding tumor and its H&E slide image.F I G U R E 6 The PSNR and SAE (GEN, GT) evaluation in one of the testing tumors.(A), (B) The graph between the PSNR and SAE (GEN, GT) values versus scanning positions for all generated optoacoustic (λ ex = 800 nm), NWs-ICG optoacoustic, and ultrasound images.
The bold values presented in Table1show the lowest value of SAE for SAE (GEN, GT), SAE (Input1, GT), and SAE (Input2, GT).If the SAE (GEN, GT) value is less than both SAE (Input1, GT) and SAE (Input2, GT), the model performance is exceptional. Note: