In Vivo Intelligent Fluorescence Endo‐Microscopy by Varifocal Meta‐Device and Deep Learning

Abstract Endo‐microscopy is crucial for real‐time 3D visualization of internal tissues and subcellular structures. Conventional methods rely on axial movement of optical components for precise focus adjustment, limiting miniaturization and complicating procedures. Meta‐device, composed of artificial nanostructures, is an emerging optical flat device that can freely manipulate the phase and amplitude of light. Here, an intelligent fluorescence endo‐microscope is developed based on varifocal meta‐lens and deep learning (DL). The breakthrough enables in vivo 3D imaging of mouse brains, where varifocal meta‐lens focal length adjusts through relative rotation angle. The system offers key advantages such as invariant magnification, a large field‐of‐view, and optical sectioning at a maximum focal length tuning range of ≈2 mm with 3 µm lateral resolution. Using a DL network, image acquisition time and system complexity are significantly reduced, and in vivo high‐resolution brain images of detailed vessels and surrounding perivascular space are clearly observed within 0.1 s (≈50 times faster). The approach will benefit various surgical procedures, such as gastrointestinal biopsies, neural imaging, brain surgery, etc.


This file includes:
Section 1: The design of GaN meta-atom for metasurface

Section 3: Telecentric configuration of the meta-varifocal endo-microscopy
Figure S3 shows the experimental setup and measurement of telecentric configuration for the endo-microscope.Figure S3 (a) shows the telecentric configuration for our endo-microscope.The measurements of the endoscope's associated focal length and NA at various Moiré metalens rotation angles are depicted in Figure S3 (b).To verify the invariant magnification property of our endo-microscope, a resolution chart is placed at the different axial focal planes of the endoscope probe depicted in Figure S3 (c).The resolution chart is shifted from the initial in focus plane (z = 0 µm) to z = 1500 µm plane with 300 µm spacing, which changes from sharp images to blurred images (top row of Figure S3 (c)).By rotating the Moiré metalens, with relative angles from 5° to 120°, the corresponding defocused image can be refocused to in-focus (bottom row of Figure S3 (c)).The results show that the telecentric design keeps both magnification and FOV as constant.The digital axial scanning for multi-plane images.

Section 4: Ray transfer matrix for the telecentric design
Ray transfer matrix calculation can help us confirm the constant magnification of the telecentric design (1).In Figure S4, the focal length of the endoscope probe is fE.The 4-f relay lens (R1 and R2) consists of focal lengths with fR1 and fR2, and the lengths of the Moiré metalens and tube lenses are fMoire and fT, respectively.Ray transfer matrix for propagation (P) in a medium with a constant refractive index is given by where d is the separation distance between two reference surfaces along the optical axis.We assume that the optical lens components used in Figure S4 are all thin lenses, which can follow lens transfer matrix (L) where f is the focal length of the optical lens.The ray transfer matrix of Moiré metalens based varifocal endo-microscopy (M) can be determined as where the PT-Moire is corresponding to the air propagation matrix from the Moiré metalens to the tube lens.We can divide the entire transfer matrix M into several parts that include endoscope probe matrix (ME), 4-f relay lens matrix (MR), Moiré metalens matrix (MMoire) and tube lens matrix (MT).These yield the following equations After multiplying all the matrices, the entire matrix can be written as From the ray transfer matrix, magnification of the entire system can be described in the first term of the matrix (M1,1), which indicates that the system magnification does not depend on the focal length of the Moiré metalens (fMoire) to satisfy the telecentric design.In addition, the axial position of the focusing beam (F) can be computed as The total axial displacement of the focusing beam (Δz) in telecentric design can be calculated as where FMax, FMin are the maximum and minimum axial position of the focusing beam, which can be given by fMoire,Min and fMoire,Max, respectively.where GHPcut is the two dimensional Gaussian high-pass filter with certain cut-off frequency (cut).
On the other hand, the speckle illumination image (Isp) can then be decomposed as where the M(x,y) is the modulation coefficient of the speckle illumination depicted in Figure S9.

Fig. S9. Speckle illumination image (Isp).
In-focus area of the speckle illumination images has the highest contrast, while the defocus area has low values.Therefore, the low spatial frequencies content (ILo) of the HiLo image can be written as where GLPcut is the two-dimensional Gaussian low-pass filter that has the identical cut-off frequency with GHPcut.LSC(x,y) is the local spatial contrast of the Isp, it can be calculated as The scaling factor (sf, typical values is the range 0.5 ~ 3) is able to balance the intensity distribution between the IHi(x,y) and ILo(x,y).Here, in our case the sf = 1 to obtain high-contrast image result.pixels.The kernel size in the encoder of each convolutional layer is set to 4×4, and the activation function of each convolutional layer is chosen as a leaky rectified linear unit (LReLU) with a slope of 0.2.In the decoder path, 9 convolutional layers (4×4 kernel size) and 4 transpose convolution layers (blue arrows in Figure S15) are utilized to up-sampling and regress the image from 16×16 to 256×256 pixels.The activation function for decoder in each convolutional layer is rectified linear unit (ReLU).For the last layer, the activation function is applied the tanh.Between the encoder and decoder, the skip connection is used to concatenate the features that match between the down-sampling and up-sampling processes.S2 and S3, respectively.
The kernel size for each convolutional layer is set to 3×3 with stride (1,1) and the leaky rectified linear unit (LReLu) with slope 0.5 is chosen to be the corresponding activation function.The shortcut connection is operated by utilizing the identity shortcuts that use 3×3 convolutional filters with stride (1,1) to make the input and output have the same dimensions to add together.For the last layer, the tanh activation function is applied.Between the encoder and decoder, the skip connection is used to concatenate the features that match between the down-sampling and upsampling processes.This step can make the down-sampling low level features directly pass to the high level layers in up-sampling, which is helpful for image transformation.In addition, the   The performance of the RCNN model can be quantitatively accessed by two different kinds of evaluation metrics (8).First is the peak signal to noise ratio (PSNR), which can indicate the quality of the images to quantify the model's reconstruction quality.PSNR is defined by the difference between two images and it can be computed as follows where the MAXI means the maximum value of the image.MSE is the mean squared error that can be written as where Im is the input (Iuni) or model predict images ( ̃HiLo).The X and Y represent the width and height of one sectioning image, and x and y are the spatial pixel coordinates in each sectioning image.If the PSNR is higher, it means that the predicted images by the RCNN model have comparable image quality with the ground truth images (IHiLo).
Structural similarity (SSIM) is the second evaluation metric, and it can quantify the similarity between the model prediction and ground truth images.SSIM can be calculated as where the Im1 and Im2 are the two images to be compared.The Im1 represents the input or model predict image and Im2 is the ground truth images.The lu, cn, and st are the luminance, contrast, and structure, respectively.α, β, and γ mean the weighting factors for corresponding parameters, and here we set it to unity.
Table S4 shows PSNR and SSIM values for input and predicted images from the validation dataset.Compared to the input images, the average PSNR and SSIM values from the RCNN predicted images is improved ~ 14 dB and 4.5 times, respectively.From Table S4, both PSNR and

Section 2 : 1 :
Fig. S1.GaN meta-atom specification.(a) Meta-atom for Moiré metalens consisting of GaN nano cylinder and Al2O3 sapphire substrate with 800 nm height (H) and 300 nm period (P).(b) Simulated results of transmission and phase spectra with different meta-atom diameters.

Fig. S3 .
Fig. S3.Experimental setup and measurement of the Moiré metalens based varifocal endomicroscopy.(a) The telecentric setup of endo-microscopy, which includes a miniature objective lens, rod achromatic doublet relay, Moiré metalens, and tube lens.(b) Experimental results of focal length and the corresponding NA of the endoscope at different Moiré metalens rotation angles.(c)

Figure
Figure S8 shows the Iuni image of mouse brain perivascular spaces.The high spatial frequencies content (IHi) of the HiLo optical sectioned image can be directly extract from the Iuni Figure S10 shows the HiLo optical sectioned image (IHiLo(x,y)).IHiLo(x,y) is generated by merging both high spatial frequencies content (IHi) andlow spatial frequencies content (ILo) to obtain the in-focus image with the entire spatial frequency range, and it can be expressed as )] , ( [ ) , ( ) , ( Lo Hi HiLo

Fig. S11 .Section 8 :
Fig. S11.Varifocal optical sectioning endo-microscopy calibration.(a) Normalized intensity distribution of fluorescent microspheres along the axial direction.Blue and red data points are measured using uniform illumination and HiLo imaging process, respectively (dots: measured raw data, curve: Gaussian curve fitting).(b) Two different depths of ex-vivo transparent mouse brain tissue (250 µm thickness) tagged with Alexa Fluor™ 555 in both wide-field (i.e.Iuni) and HiLo (IHiLo) imaging results.(c) 3D reconstruction volume images of mouse brain various perivascular space locations in uniform illumination and HiLo process (each 3D image is 750 µm × 750 µm × 1 mm).

Fig. S13 .
Fig. S13.Comparison results of Alexa Fluor 488 labeled ex-vivo transparent mouse brain tissue.(a) Wide-field images, (b) HiLo optical sectioning images, and (c) comparison of normalized intensity profile.
dropout layer and batch normalization layer are applied to the model, which can induce more stabilization in training process and performance.With the help of the shortcut connection operations, the model turns into the counterpart residual version of inputs.It solves vanishing/exploding gradients and degradation issue of the conventional convolutional neural network.

Fig. S19 .
Fig. S19.The comparison results of the inputs, ground truths, and predictions from the training dataset.(a) Resultant images of five different types of fluorescent samples that include ex-vivo mouse brain tissue, fluorescent beads, corn stem, lily anther, and basswood stem taken by

Fig. S20 .
Fig. S20.Images comparison results for the ex-vivo mouse brain.a, With corresponding Moiré metalens rotation angles, Iuni, IHiLo and  ̃HiLo fluorescent images at three different depths.b, Absolute error maps in the left column show the difference between Iuni and IHiLo, while absolute error maps in the right column show the difference between  ̃HiLo and IHiLo.

Fig. S21 .
Fig. S21.In-vivo fluorescence images of right side of mouse brain.a, In-vivo images of widefield, ground truth, and model predictions at three different depths.b, Absolute error maps comparison results on the left column show the difference between Iuni and IHiLo, while absolute error maps on the right column show the difference between  ̃HiLo and IHiLo.

Fig. S22 .
Fig. S22.In-vivo fluorescence images of left side of mouse brain.a, In-vivo images of widefield, ground truth, and model predictions at three different depths.b, Absolute error maps comparison results on the left column show the difference between Iuni and IHiLo, while absolute error maps on the right column show the difference between  ̃HiLo and IHiLo.