Deep Learning Enables Superior Photoacoustic Imaging at Ultralow Laser Dosages

Abstract Optical‐resolution photoacoustic microscopy (OR‐PAM) is an excellent modality for in vivo biomedical imaging as it noninvasively provides high‐resolution morphologic and functional information without the need for exogenous contrast agents. However, the high excitation laser dosage, limited imaging speed, and imperfect image quality still hinder the use of OR‐PAM in clinical applications. The laser dosage, imaging speed, and image quality are mutually restrained by each other, and thus far, no methods have been proposed to resolve this challenge. Here, a deep learning method called the multitask residual dense network is proposed to overcome this challenge. This method utilizes an innovative strategy of integrating multisupervised learning, dual‐channel sample collection, and a reasonable weight distribution. The proposed deep learning method is combined with an application‐targeted modified OR‐PAM system. Superior images under ultralow laser dosage (32‐fold reduced dosage) are obtained for the first time in this study. Using this new technique, a high‐quality, high‐speed OR‐PAM system that meets clinical requirements is now conceivable.


The supplementary information of OR-PAM and experimental operations
As shown in Figure S1 (a), the output beam from the multi-mode fiber was collected into the imaging head, which was collimated by an inverted objective, and then focused by an identical objective to provide focused optical illumination into the tissue below the water dish.
As shown in Figure S1 (b), the spatial resolution measured with a sharp metallic blade is ~19.8 μm. For in vivo imaging, the pixel step size is 10 μm. Both laser sources were operated at 1 KHz repetition rate. The full sampling data were obtained by laser energy reaching the biological tissue of ~800 nJ per pulse (likewise, 1/2, 1/3, and 1/4 the ANSI limit per pulse laser energy in this study corresponds to 400, 267, and 200 nJ per pulse, respectively), corresponding to an optical fluence of ~13.33 mJ/cm 2 on the skin surface (when focused ~200 μm below the skin surface), which conforms to the ANSI safety standards. Medical-grade ultrasound gel was evenly dispensed between the animal and water tank for acoustic coupling.
Photoacoustic waves emitted from the tissue propagated through the water and the Combiner (mainly containing a rhomboidal prism and a triangular prism), totally reflected at the silicone oil layer filled between the rhomboidal and triangular prisms of the Combiner, and then detected by a 50-MHz ultrasonic transducer (V214-BC-RM, Olympus-NDT, Japan) placed on the top surface of the Combiner. The detected signals were pre-amplified using a commercial electrical amplifier (ZFL500LN-BNC, Mini-Circuits), digitized via a 200-MS/s data acquisition (DAQ) card (CS1422, GaGe), and then stored in a personal computer. A 6-mm diameter acoustic lens (45006, Edmund, acoustic NA: 0.43) was firmly attached to the bottom of the Combiner to enhance receiving sensitivity by confocally aligning the optical and acoustic foci. Volumetric OR-PAM images were obtained with mechanical raster scanning via precisely programmed electrical scanners (PLS-85, Micos). The imaging speed of the OR-PAM system, limited by the speed of mechanical scanning, was ~0.5 frame (B-scan) per second.

Details of the MT-RDN convolutional neural network architecture
The proposed MT-RDN network is divided into three subnetworks, as shown in Figure S2.
Each subnetwork employs an independent RDN framework [1] and is assigned a task of supervised learning. The three subnetworks are connected through weight distribution. First, Inputs 1 and 2 are used in Subnet 1 and 2 to reconstruct Recons 1 and 2, respectively. Recons 1 and 2 are summed according to the λ ratio and used as the input to Subnet 3 to reconstruct Recon 3. The red dashed boxes in the three subnetworks in Figure S2 are residual dense block (RDB) blocks, and the detailed operation of each RDB block is described in the expanded red dashed box on the right side of Figure S2. The forward process of this network is described below.  Third, the up-sampled features go through D (D=4 in this study) RDBs for local feature fusion. The output d F of the d-th RDB is expressed as follows: Fourth, these residual dense features from D RDBs are merged via global feature fusion (concatenation + 1×1 convolution): Finally, we stack an up-scaling layer (in HR space) to obtain the finer resolution: Here, we obtain the reconstruction of the two channels , 1, 2 n HR In  .
These two reconstruction results from the two channels (532 nm and 560 nm) are summed in λ proportions and are used as input to subnetwork 3: The forward process of subnetwork 3 is summarized as follows (n=3): The forward process of subnetwork 3 is similar to subnetworks 1 and 2, with the exception that there is no up-sampling operation in this subnetwork.
The details of the convolutional layers are shown in Table S1 [2,3] . The mini-batch size was 2.
The exponential decay learning rate [4] was used in all CNN-based experiments, and the initial learning rate was set to 0.0001 with a decay rate of 0.95. All the models were trained using the Adam optimizer [5] with parameters beta_1=0.9, beta_2=0.999, and epsilon=10^-8.
The loss function at the end is defined as follows: All are depth-encoded to reflect the three-dimensional position information, and two areas (white dotted frames) are also enlarged for careful observation. From these figures, it can be determined that the image quality obtained by 1/2 the ANSI limit per pulse laser energy is significantly reduced in the under-sampling images ( Figures S3.1 (a) and (d), Figures S3.2 (a) and (d)) compared to the images reconstructed by our method (Figures S3.1 (i) and S3.2(i)),

Results and Discussion of the brain and ear data in detail
where high image quality is still maintained.
To further quantify the advantages of the proposed method, the signal intensity curves, The vascular distortion due to image reconstruction is shown in Figure S3.2 (k) through a small selected region. Notably, for evaluating distortion, we deliberately selected blood vessels that are more likely to be distorted after reconstruction, thus verifying the reconstruction fidelity of the proposed method. The same region is selected across Figures   S3.2 (b2), (h2) and (i2), and the signal strengths are plotted in Figure S3.2 (k). The different colours of the signal intensity curves in Figure S3.2 (k) represent the corresponding colours in Figures S3.2 (b2), (h2) and (i2), respectively. Corresponding to Figure S3.2 (k), severe vascular distortion can be observed in Figure S3.2 (h2): (1) two blood vessels shown in Figure  S3.2 (b2) are identified as three in Figure S3.2 (h2) after filtering; and (2) the blood vessels in Figure S3.2 (h2) are thinner after filtering. In comparison, no vascular distortion occurs in Figure S3.2 (i2) (i.e., two blood vessels in Recon 3), and the size of the blood vessels remains basically unchanged after reconstruction. Although the SNR value of the data reconstructed by our method is slightly lower than that of Ground truth 3, the accuracy of vascular enhancement is much higher, hence proving its advantage. It is worth noting two main reasons for the difference in quality between Input 1 and Input 2 in all images: 1. The excitation wavelengths used by Input 1 (532nm) and Input 2 (560nm) are different; 2. The quality of the excitation laser spot corresponding to Input 2 is inferior, which is due to the use of the OPO lasers. Therefore, if the problem of OPO laser spot quality was solved, the quality of the reconstruction images produced by the proposed method would be better.
All operations of Data 4 are the same as Data 5 (i.e., Ground truths 3 reconstructed by another personal computer (PC) with a 128 GB CPU), and the results are shown in Figures S3.4.

Analysis of the image reconstruction results of the imaging data under lower excitation laser dosage
The results of 2× under-sampled images obtained by 1/3 and 1/4 of the ANSI limit per pulse laser energy are shown in Figure S4  Thus, even at a very low SNR level, our method improves the SNR by approximately 2× and reconstructs high-quality images.