Generating Multi‐Depth 3D Holograms Using a Fully Convolutional Neural Network

Abstract Efficiently generating 3D holograms is one of the most challenging research topics in the field of holography. This work introduces a method for generating multi‐depth phase‐only holograms using a fully convolutional neural network (FCN). The method primarily involves a forward–backward‐diffraction framework to compute multi‐depth diffraction fields, along with a layer‐by‐layer replacement method (L2RM) to handle occlusion relationships. The diffraction fields computed by the former are fed into the carefully designed FCN, which leverages its powerful non‐linear fitting capability to generate multi‐depth holograms of 3D scenes. The latter can smooth the boundaries of different layers in scene reconstruction by complementing information of occluded objects, thus enhancing the reconstruction quality of holograms. The proposed method can generate a multi‐depth 3D hologram with a PSNR of 31.8 dB in just 90 ms for a resolution of 2160 × 3840 on the NVIDIA Tesla A100 40G tensor core GPU. Additionally, numerical and experimental results indicate that the generated holograms accurately reconstruct clear 3D scenes with correct occlusion relationships and provide excellent depth focusing.


Introduction
Compared to other 3D display technologies, such as headmounted displays, [1] autostereoscopic displays, [2] and volumetric computing, [32,33] optical encryption, [34,35] and angular momentum holography. [36,37]ccording to the basic element representation used in CGHs, several kinds of algorithms have been developed to generate CGHs, such as point cloud, [38] lines, [39] curves, [40] polygons, [41] layers, [42] ray-tracing, [43] and so on.However, the nature of computer computation causes CGHs to suffer from a compromise between reconstructing a more realistic scene and achieving greater computational efficiency.It has often been referred to as the "computational bottleneck." [44]The non-local many-tomany mapping of the diffraction integral gives rise to tremendous computation along with massive memory access and update, which becomes more serious when the resolution of the CGH increases.Meanwhile, accurately modeling light propagation requires a high-bit depth complex-valued representation, which is extremely time-consuming.The additional complex amplitude encoding will also require extra time if the CGH is loaded on a phased-modulated SLM.Some acceleration methods have also been developed to speed up the calculation of CGH, such as the look-up table, [45] wavefront-recording-plane method, [43] stereogram approximation, [46] Fourier domain sparsity, [47] and wavelet transform. [48]Besides the algorithm development, the hardware acceleration has also been carried forward, such as CPU, [49] GPU, [49] and FPGA. [11]he generation of the CGH can be viewed as an inverse problem.Optimization algorithms, such as iterative phase-retrieval algorithms [50][51][52] or (stochastic) gradient descent, [13,16] are employed to solve this problem.These algorithms iteratively seek a phase-only pattern that produces a propagated wavefront matching the desired amplitude distribution.Recently, as a prosperously developed tool for optimization, a machine learning approach called deep neural network has been employed as a new paradigm for solving inverse problems in various fields of imaging, photonics, [53][54][55] and holography. [56,57]In CGH, the DNN is used to solve the complex non-linear optimization process.The differentiable nature of wave propagation and the maturity of differentiable software infrastructures have nurtured learningbased CGH algorithms that address the high computational cost. [56]The DNN also takes the advantage of the flexibility of modifying the network architecture, tuning network parameters, improving the training algorithms, as well as optimizing the hardware systems.Deep learning techniques have been employed to generate high-quality holographic images and a multilevel loss function has been proposed to optimize the training process of wave propagation models. [58]ompared to planar holograms, 3D holograms require focusing at different depths, making them more computationally demanding and harder to generate.Traditional methods for 3D CGH generation involve complex algorithms and significant computational resources.In contrast, the DeepCGH approach leverages the power of deep learning to simplify and optimize the hologram generation process. [59]Furthermore, a novel deep neural network architecture and training strategy for multi-depth hologram generation is presented. [60]It can advance holography techniques by harnessing the power of deep learning, thereby opening up new possibilities for realistic and efficient 3D imaging applications.And some holograms can be synthesized by optimizing a wave field to reconstruct multiple varifocal images. [61]ey are rendered through a physically-based renderer, providing real-world-like defocus blur and achieving photorealistic reconstruction.A large-scale CGH dataset with 4000 pairs of RGBdepth images and corresponding 3D holograms is designed to train a deep-learning-based CGH pipeline, which can generate photorealistic color 3D holograms from a single RGB-depth image in real time. [5]ulti-depth holograms have the ability to reconstruct 3D scenes focused at different depth planes and demonstrate excellent defocusing blur when the depth information does not match the reconstruction distance.The generation of holograms poses an ill-posed inverse problem, [62,63] while the generation of multidepth holograms presents an even greater level of challenge.Despite attempts to improve them, some methods like stochastic gradient descent and Gerchberg-Saxton which are commonly used for generating 2D holograms prove to be mediocre when applied to multi-depth holograms, resulting in unsatisfactory outcomes. [64,65]Furthermore, as a result of inadequate network architecture and improper handling of occlusion relationships, some deep learning-based methods yield subpar scene clarity and fall short of achieving accurate occlusion culling in the reconstructed scenes. [60,66]ere, we propose a forward-backward-diffraction framework for the first time to compute multi-depth diffraction fields and combine it with a carefully designed fully convolutional neural network (FCN) to generate multi-depth holograms, to the best of our knowledge.This framework implements occlusion of the foreground on the background's diffraction field during the forward propagation process of the diffraction field and determines the reconstruction distance through the backward propagation of the diffraction field.In addition, we utilize 3D modeling software to acquire 3D graphical datasets and introduce a multi-depth loss function to improve the fitting performance of the FCN.During the reconstruction process of a 3D scene, the absence of information from occluded objects leads to prominent dark boundaries between different layers, thereby compromising the visual continuity of the 3D scene.To address this issue, we propose the layerby-layer replacement method (L 2 RM) to smooth the boundaries between different layers and handle the occlusion relationships among them.Finally, we validate the effectiveness of the generated multi-depth holograms through numerical and optical experiments, showcasing their high reconstruction quality and excellent defocus blur.

Computation of Multi-Depth Diffraction Field
Designing a 3D scene and obtaining its multi-depth diffraction field are essential prerequisites for generating multi-depth holograms.This section outlines the acquisition of 3D graphical datasets using modeling software and proposes a forwardbackward-diffraction framework to compute its multi-depth diffraction field.

3D Graphical Datasets
In the process of generating a 3D CGH with neural networks, high-quality 3D graphical training datasets are often scarce.In traditional methods, researchers typically use depth cameras to record real 3D scenes for obtaining training datasets. [67]However, the traditional method is costly, and the operation is quite complicated.For example, time-of-flight depth cameras [68,69] consist of components such as light sources, optical elements, sensors, control circuitry, and processing circuitry.These cameras are known for their high cost and using them for sampling 3D scenes can be easily affected by multiple reflections.
A 3D modeling software named "Blender" is used to flexibly and conveniently sample 3D scenes for generating 3D graphical datasets.Compared to real cameras, using virtual cameras for sampling does not cause barrel distortion or pincushion distortion in the images.As shown in Figure 1A, various models of fruits and vegetables are combined randomly into a 3D scene.In this 3D scene, each model has a random size, location, and rotation angle.A virtual camera is employed to capture the 3D scene to get the intensity-and depth-image, which is used to make up the 3D dataset, as shown in Figure 1B.The resolution of the train-ing dataset is set to 2160 × 3840 during the sampling process.We positioned the virtual camera far away from the 3D scene and set its focal length to 350 mm.This helps to avoid any distortion in model size caused by perspective.Before each sample, the 3D scene will be updated with a new random combination of models.Thus, each intensity image and its corresponding depth image in the training dataset will be unique, as shown in Figure 1C,D.Their values are distributed within the range of 0-1.We sampled 900 randomly generated 3D scenes and assigned 800 sets of depth images as the training dataset, while 100 sets of depth images were designated as the test dataset.

Forward-Backward-Diffraction Framework
The calculation process of the multi-depth diffraction field is demonstrated using Figure 2A as an example.The intensity information I of the 3D scene is divided into six sample slices as shown in Figure 2A, which have the same depth interval of Δz = 5mm.According to the dissemination sequence of the diffraction field in the reconstruction process, these slices are respectively located at the depth planes (i) of z i = 10 + (i − 1) × 5mm (i = 1, 2, ⋅⋅⋅, 6), and the hologram plane is located at z 0 = 0mm.We set the initial phase of slices to 0. The sample slice's wave filed √ I 1 in the depth plane (1) propagates forward to the depth plane (2), and the first diffraction field U 1 is obtained.The superposition of the sample slice's wave filed √ I 2 in the depth plane (2) and the first diffraction field U 1 propagate forward to the depth plane (3), and the second diffraction field U 2 is obtained.By analogy, the superposition of the sample slice's wave filed √ I 5 in the depth plane (5) and the fourth diffraction field U 4 propagate forward to the depth plane (6), and the fifth diffraction field U 5 is obtained.Finally, the superposition of the sample slice's wave filed √ I 6 in the depth plane (6) and the fifth diffraction field U 5 propagate backward to the hologram plane, and the sixth diffraction field (the multi-depth diffraction field) U 6 is obtained.It should be noted that in the process of forward propagation of the diffraction field, directly adding the sample slice to the previous diffraction field is inaccurate.Instead, a mask is needed to remove the light field corresponding to the sample slice in the diffraction field before adding the sample slice.This is done to simulate the occlusion effect that occurs when the light field of a 3D scene propagates in reality.In the paper, we use the symbol "⊕" to represent this operation method.
Without loss of generality, the above process can be extended to the general situation.Suppose the scenes are divided into N sample slices, according to the angular spectrum theory [70] the complex amplitude of the n th diffraction field U n (x,y) can be expressed using the angular spectrum method (ASM) as where F{•} is the 2D Fourier transform and F −1 {•} is the 2D Fourier inverse transform.U 0 (x,y) = 0.I n (x,y) is the intensity distribution of the scene in the n th sample slice.The operation method "⊕" is expressed as where M n (x,y) is the mask corresponding to the n th sample slice.d(x, y) is the discrete depth distribution of the 3D scene.
This method for handling occlusion relationships will be discussed in detailed in Section 4.1.f x and f y are respectively the spatial frequency of the diffraction field in x-axis and y-axis.H z (f x ,f y ) is the optical transfer function (OTF) in the propagation process of the diffraction field and expressed as where j is the imaginary unit, j 2 = −1. is the wavelength of the laser used and and Δp is the pixel interval of the SLM.
As described above, we refer to this diffraction field propagation model as the "forward-backward-diffraction framework," which initially utilizes forward propagation to address the occlusion relationship between the layer and the diffraction field, and then employs backward propagation to determine the scene's diffraction distance.In contrast to the traditional multi-depth method based on a series of backward diffractions, this diffraction computation framework can handle foreground occlusion on the background's diffraction field, thereby eliminating excess diffraction fields and avoiding unwanted plane wave components interfering with the reconstructed scene.This "forward diffraction + backward diffraction" working mode can both correctly handle occlusion relationships and achieve accurate depth focusing.Ultimately, through the coordination of this forward-backward-diffraction framework and occlusion processing, a multi-depth diffraction field U N (x,y) of the 3D scene at the plane where the SLM is positioned is obtained.

Generation and Optimization of Multi-Depth Holograms
After the multi-depth diffraction field of the 3D scene's dataset is prepared, the construction of a fully convolutional neural network is required to generate multi-depth holograms.Furthermore, it is crucial to calculate the multi-depth errors between the reconstructed images at various diffraction distances from the hologram and the target scene, aiming to train the network for enhancing the reconstruction quality of the holograms.

Network Structure
A U-Net-based FCN is designed to generate multi-depth holograms in this work.The deep convolutional layers of U-Net allow it to effectively learn intricate features and preserve fine details from diffraction fields.With its ability to capture both local and global context, U-Net generates high-quality and realistic holographic outputs while utilizing limited training data.The details of the FCN are shown in Figure 2B.The multi-depth diffraction field U N (x,y) is decomposed into real part U R (x,y) and imaginary part U I (x,y), and then they are as the input of the FCN, while the phase-only hologram Φ(x, y) is output.To reduce the edge fringes of the diffraction fields resulted from the diffraction, the resolution is expanded from 2160 × 3840 to 2304 × 4096 by zero padding.The "phase recombination" method is used to optimize the structure of the input and output layers to reduce the memory footprint and promote generation speed, [71] thus the 2304 × 4096×2 input information is converted to a matrix that is 1152 × 2048 × 8 in size, and the input layer of the network has eight channels.In the processing of input information, multiple 3 × 3 × 8 convolutional kernels perform convolutions on the input data with a stride of one.This operation can restructure the real and imaginary parts and link them together.In addition to the input layer, the FCN includes two down-sampling layers, two up-sampling layers, a middle layer, an output layer, and a skipconnection layer.
As mentioned above, the process of generating holograms from the FCN involves two down-sampling and two up-sampling steps.In the first down-sampling step, the feature of the highest down-sampling layer is extracted from input information using two 3 × 3 convolutional operations and two ReLU functions.This feature becomes the input of the second down-sampling layer after a 2 × 2 max-pooling operation.The feature at the second down-sampling layer is obtained in a similar way to the first and serves as the input to the middle layer.After a 2 × 2 upconvolution operation, the second up-sampling layer takes the feature extracted from the middle layer as input.Similarly, the highest up-sampling layer takes the feature extracted from the second layer as input.Finally, the output of the FCN is obtained from the highest up-sampling layer feature through a 1 × 1 convolution operation.
In general, down-sampling is intended to compress the input information, while up-sampling is designed to restore it.During down-sampling, the resolution of the input information will be reduced, while the number of its channels will be increased.Conversely, the resolution of the input information will be increased, and the number of its channels will be decreased during up-sampling.The feature extracted during down-sampling is concatenated with the feature extracted during up-sampling using skip-connections in order to retain more high-resolution details contained in high-layer features.The multi-depth hologram Φ(x, y) can be obtained from the output of FCN through "phase recombination."

Training Methods
The designed FCN implements the coding operation → Φ(x, y) in a complex and implicit way.However, before training the network model, the relationship between the input and output is unclear, and the holograms' reconstructed 3D scenes have poor performance.Therefore, the errors between the target and reconstructed scenes need to be well-defined to update the network's node parameters, and the back-propagation algorithm is employed.When generating 2D scene holograms using neural networks, only the error of one depth plane needs to be calculated.However, the multi-depth hologram can reconstruct 3D scenes on different depth planes, so it is necessary to calculate the multi-depth errors for different depth planes respectively.An L1 (average absolute error) loss function is adopted to calculate the multi-depth error between the reconstructed scenes and the target scenes, which is defined as: where I(x, y) is the overall intensity distribution of the 3D scene and U′(x, y) is the recombination of the reconstructed diffraction field.Equation ( 5) deviates from commonly used normalization methods by aligning √ I(x, y) and U′(x, y) in the energy domain.This approach effectively reduces the brightness disparity caused by high-value noise pixels, thereby preventing loss oscillation and facilitating network convergence.
U′ n (x,y) is the complex amplitude of the reconstructed diffraction field at the n th depth plane by the multi-hologram Φ(x, y), which is expressed as Figure 2C shows the above calculation process of the multidepth error.After obtaining the reconstructions of the multidepth hologram at all depths using ASM, the focused parts from each reconstruction are extracted and recombined into a reconstructed scene.Finally, the loss between this reconstructed scene and the target scene is calculated to update the parameters of the network model.

Enhancement of Occlusion Effects
Occlusion is a critical visual cue in 3D displays, which means that the light field of an object located far away from the observer will be occluded by other objects that are closer to the observer as the light propagates forward.If the issue of occlusion is not effectively addressed, it will lead to a deterioration in the reconstruction quality of the 3D scene.

Standard Method
As shown in Figure 2A, the forward-backward-diffraction framework presented in Section. 2 attains occlusion effects through the application of a mask for hard-blocking the diffraction field.In this paper, we refer to this conventional approach to occlusion handling as the "standard method."Additionally, we have designed a 3D scene featuring "pokers" with four layers to provide a detailed illustration of the standard method.As illustrated in Figure 3A, this 3D scene comprises four layers, sequentially arranged from the first to the fourth layer as follows: the tablecloth, the 10 of clubs, the 9 of diamonds, and the 8 of spades.Figure 3B presents the sampling results of the 3D scene, featuring both the intensity and depth images.In the depth image, the depth of each layer is proportional to the gray value.Figure 3C shows the sample slices and their corresponding "0-1" masks, which are calculated from Equation (3).In the n th mask, "0" means that the n th slice occludes the (n − 1) th diffraction field at the corresponding position, while "1" means that the n th slice doesn't occlude the (n − 1) th diffraction field at the corresponding position.
The standard method for handling occlusion relationships implemented via Equation ( 2) is shown in Figure 3D.First, compute the diffraction field following the propagation of the first slice to depth plane (2) using ASM.As this diffraction field results from the propagation of the first slice, it is referred to as the first diffraction field.Subsequently, multiply the first diffraction field with the second mask derived from the second slice to obtain the occluded first diffraction field, in order to achieve the occlusion effect of the second slice on the first slice.After achieving the occlusion effect, superimpose the occluded first diffraction field and the second slice to propagate toward the depth plane (3) to obtain the second diffraction field.Subsequent steps follow a similar pattern, such as multiplying the second diffraction field with the third mask derived from the third slice to achieve the occlusion effect of the third slice on the first and second slices.Then, continue propagating the superposition of the occluded second diffraction field and the third slice and so forth.

Layer-by-Layer Replacement Method
While employing the standard method to process occlusion relationships, the absence of information regarding the occluded objects can result in the appearance of distinct dark fringes at the boundaries of different layers, a phenomenon known as the "diffraction missing phenomenon," clearly visible in Figure 6.This phenomenon is related to the initial phase of the slices.When calculating the diffraction field of the slices, a random distribution of initial phases leads to severe speckle noise in the diffraction field [72] ; therefore, we set the initial phase of the slices to 0. However, waves with the same initial phase will produce coherent superposition, resulting in fringes at the edges of the diffraction field.The absence of information from occluded objects causes the boundaries between slices to be equivalent to the edges of the light field, leading to the appearance of similar fringes at corresponding positions in the diffraction field.The dark line closely adjacent to the boundaries of different slices in Figure 6 represents the most prominent dark fringe in these patterns.They disrupt visual continuity and reduce the reconstruction quality of the 3D scene.Figure 4A depicts the disruption of visual continuity caused by this dark fringe, where the dark purple rectangles represent real pixels of the (n − 1) th sample slice and the bright purple rectangles represent real pixels of the n th sample slice.
In this paper, the layer-by-layer replacement method (L 2 RM) is proposed as a means to mitigate the presence of the diffraction missing phenomenon in multi-depth diffraction fields and the reconstructed scenes of multi-depth holograms.As shown in Figure 4B, the first step involves replacing the occluded pixels in the (n − 1) th sample slice with known pixels at the corresponding position in the n th sample slice or in a higher-level sample slice.Next, we compute the diffraction field of the (n − 1) th sample slice and once again replace the pixels in the (n − 1) th diffraction field with those at the corresponding position of the n th layer.Due to the filling of missing pixels, the boundaries of the computed diffraction field will become smoother and can be seen as the continuation of the n th sample slice.
The process of utilizing L 2 RM to enhance occlusion effects is shown in Figure 4(C).The intensity image propagates forward, resulting in the generation of the first diffraction field.Subsequently, the second mask occludes the first diffraction field, allowing for the superposition of the latter with the second sample slice to propagate forward together, yielding the second diffraction field.By analogy, each superposition of the diffraction field and the sample slice is replaced by the next sample slice at the corresponding position after diffraction.This calculation process can be expressed as The above iterative process results in a multi-depth diffraction field that is free from the diffraction missing phenomenon.Subsequently, this field is fed into the designed FCN to generate a multi-depth hologram with reconstructed scenes that also exhibit no diffraction missing phenomenon.

Experimental Results and Discussion
Since the SLM used can only modulate monochromatic light, the grayscale values of the dataset were used for training and testing the neural model in the actual experiment.If the hardware equipment is sufficiently good, it is feasible to generate holograms separately for each color channel of the dataset and reconstruct a color image accordingly.Similar to 2D reconstructed images, we calculate the Peak Signal-to-Noise Ratio (PSNR) and the Structural Similarity Index (SSIM) by comparing the intensity image of the 3D scene with the recombination of the reconstructed diffraction fields.Since the forward-backwarddiffraction framework and FCN are fundamental methods employed in this paper, the standard method discussed in the experiments specifically pertains to "forward-backward-diffraction framework + FCN + standard method," with L 2 RM denoting "forward-backward-diffraction framework + FCN + L 2 RM."We generated multi-depth holograms of testing datasets and evaluated their reconstruction quality, as illustrated in Figure 5A.The average intensity PSNR of the reconstructed scenes using L 2 RM is 31.8dB, with an SSIM of 0.86, which is 4.0 dB and 0.11 higher compared to the standard method, respectively.As shown in Figure 5B, the average processing times for generating a multi-depth diffraction field from a 3D scene and generating a multi-depth hologram using FCN are 54 and 36 ms, respectively.Therefore, on a single GPU computing platform, it takes ≈90 ms to generate a multi-depth hologram from a 3D scene.To achieve faster processing speeds, multiple GPUs can be employed to establish a parallel computing system where each GPU handles the computation of the diffraction field for a single layer (≈9 ms).Operating in a pipelined manner, this parallel computing setup can enhance the efficiency of large-scale tasks, such as generating 3D holographic videos, theoretically approaching 36 ms per frame.
We reconstructed the "poker" scene shown in Figure 3B using both the standard method and the proposed L 2 RM, with the results presented in Figure 6.
Figure 6B shows the numerical reconstructions of two methods in different layers.It is obvious that when implementing the occlusion culling in diffraction calculation, L 2 RM effectively alleviates the diffraction missing phenomenon existing in the standard method and enhances the reconstruction quality.As the counterpart of the numerical reconstruction, Figure 6C shows the optical experimental results for the standard method and the proposed L 2 RM.The experimental results give identical validation as that of the numerical ones, which indicates that the proposed L 2 RM can implement the occlusion culling well.
In order to verify the effectiveness of the proposed method for complex scenes, we constructed a 3D scene depicting a living room and sampled it.The 3D scene is shown in Figure 7A, and the main objects of interest are the tree, clock, airplane, guitar, bear, and bird.The corresponding depth information is divided into six sections, as depicted in Figure 7B.After calculating the multi-depth diffraction field using the proposed L 2 RM and inputting it into the trained network model, we obtained the multidepth hologram of this 3D scene, as shown in Figure 7C.It can be observed that the phase of the generated hologram is smooth at all depths, indicating that the diffraction orders can be separated with a simple operation.Additionally, the smooth variation of phase is likely that the reconstructed scene will exhibit reduced speckle noise.
To validate the effectiveness of L 2 RM, we choose WH [16] based on iterative thinking and DPH [17,73] utilizing heuristic coding approximation as comparison techniques.The former involves 200 iterations, with a hologram generation speed of 6.5 s per frame and an average reconstructed image PSNR of 23.5 dB.The latter achieves a hologram generation speed of 60 ms per frame and an average reconstructed image PSNR of 28.6 dB.Numerical and optical reconstructions are independently generated for the complex scene illustrated in Figure 7 using WH, DPH, and L 2 RM. Figure 8 depicts these experimental outcomes.Our analysis focuses on the "football-guitar" pair and "airplane-dog" pair to contrast the defocus blur and image clarity among the reconstructions produced by the three methods."Football" and "airplane" are positioned in the foreground, while "guitar" and "dog" are situated in the background.
Since each run of the WH algorithm optimizes for just one depth plane, 3D WH is composed of overlaying multiple 2D WH optimizations tailored to different depth planes.While the reconstruction quality of 2D WH is typically superior within the same group, this principle does not universally extend to 3D holography.In defocused scenarios, 2D WH exhibits a significant amount of speckle noise. [16]Consequently, as shown in Figure 8A, the reconstructed images of 3D WH will display speckle noise proportional to the number of its depth planes.Furthermore, 3D WH also lacks support for smooth depth variations.
The principle of DPH involves representing any complex function as the sum of two unit complex functions.Just like the processing technique in 3D WH, we superimpose multiple depth planes of DPH to create a 3D DPH.A significant limitation of DPH is that the region with a high signal-to-noise ratio in the reconstruction plane is very narrow along the orientation of the DPH macropixels. [74]Furthermore, the complex amplitude decomposition method of DPH diminishes the available spatial bandwidth product, resulting in the loss of high-frequency details and causing the ringing effect (artifact) in the reconstructed image.Illustrated in Figure 8B, regions with abrupt grayscale changes like the transition between dark and light areas of a "football," the edges of a "guitar," and the tail of an "airplane" demonstrate prominent edge artifacts attributed to the absence of highfrequency components.
The reconstructed image depicted in Figure 8C is generated using the method proposed in this paper.It is evident that, in comparison to WH, the reconstructed image from the proposed method exhibits minimal speckle noise.Similarly, when contrasted with DPH, there are also a few edge artifacts present in the reconstructed image.WH can only iteratively optimize the reconstructed image of a single depth plane, whereas FCN can simultaneously optimize the reconstructed images of multiple depth planes through a multi-depth loss function.As a result, the proposed method can achieve seamless depth transformation without substantial speckle-noise during defocusing.During the DPH generation process, two-phase matrices, separated from the diffraction field, intersect using a "checkerboard" encoding method to create a phase matrix (phase-only hologram). [74]his decomposition and combination approach results in the loss of half of the phase information, leading to the high-frequency components losing their carriers and consequently causing artifacts to appear in the reconstructed image.The primary method by which FCN processes diffraction fields is through convolution operations.While convolving the diffraction field, the dot product operation of the convolution kernel connects the real and imaginary parts of the diffraction field, while the sliding operation links neighboring elements within the diffraction field. [75]After a finite number of convolutions and transposed convolutions, [76] the receptive field of the generated hologram surpasses the diffraction range corresponding to the diffraction angle.We consider this as the foundation for neural networks to reconstruct diffraction fields.Expanding on this foundation, by training and optimizing FCN, it becomes feasible to retain high-frequency components to a significant extent.Consequently, in comparison to DPH methods struggling to effectively retain high-frequency components, the proposed method yields reconstructed images with minimal edge artifacts.
Furthermore, we respectively focus the camera on the depth planes of 10, 15, 20, 25, 30, and 35 mm.As depicted in Figure 9, we capture the optical reconstructions of the sub-objects at each depth, including the tree, clock, airplane, guitar, bear, and bird.It can be observed that all objects of interest exhibit clear reconstructions at their corresponding depths while appearing blurry at incorrect depths.Additionally, the optical reconstructions of the multi-depth hologram show minimal speckle noise.Edge artifacts only manifest around objects that are out of focus, and they can be significantly reduced when properly focused.

Conclusion
In this paper, a random 3D training data set, a fully convolutional neural network and the multi-depth loss function are introduced for generating phase-only holograms from the multi-depth  diffraction field of a complex scene.Furthermore, the L 2 RM is introduced to implement the occlusion culling and to smoothen the boundaries between different layers during the reconstruction process.Numerical and optical experiments have demonstrated that the reconstructed scene exhibits high display quality and excellent 3D depth-focusing.
In the next stage of our research, we plan to incrementally decrease the depth interval until reaching a certain level of continuous 3D reconstruction.However, when the depth interval is reduced, the depth-focusing may not be apparent if no additional depth layers are added.But adding more depth layers may cause objects to be segmented improperly and lead to more boundary artifacts in the reconstructed image.Meanwhile, the large number of depth layers will greatly increase the computational cost of generating multiple-depth diffraction fields.To address the aforementioned issues, we are considering introducing spherical waves and utilizing convolution methods to generate the diffraction field instead of ASM.
We also intend to achieve full-color reconstruction of 3D scenes by introducing a three-color laser through time-division  and C) L 2 RM.The images in rows 1, 3, and 5 represent numerical reconstruction, while rows 2, 4, and 6 depict optical reconstruction.In columns 1 and 2, the camera focuses on the front focus plane ("football") and the rear focus plane ("guitar") of the "football-guitar" pair, respectively.In columns 3 and 4, the camera focuses on the front focus plane ("airplane") and the rear focus plane ("dog") of the "airplane-dog" pair, respectively.multiplexing.Certainly, aligning the diffraction patterns of different monochromatic light at the micron level is still a significant challenge.In addition, the generated model contains a large number of parameters, resulting in significant memory consumption and placing high demands on hardware devices, especially graphic processing units.On regular computing platforms, the generation speed of multi-depth holograms will be significantly reduced, or even render the model unable to load due to insufficient graphics memory.Therefore, we plan to optimize the network model structure by reducing the number of model parameters, using lower-precision parameters, compressing the model, and other methods to decrease memory usage and facilitate the fast generation of multi-depth holograms.

Experimental Section
The numerical platform was based on Python 3.8.13,PyTorch version 1.11.0, and CUDA version 11.6.The designed FCN model was trained and tested on the NVIDIA Tesla A100 40G tensor core GPU.The Adam optimizer was used for optimizing the weights and biases with a learning rate of 0.0005.The model was trained for a total of 50 epochs.
Figure 10 depicts the optical experiment setup where a non-polarizing semiconductor laser with a wavelength of 638 (±8) nm, a power of 30 mW, and a single-mode fiber with a core diameter of 4 μm were used as the reconstruction light source.The relatively broad linewidth could prevent the coherent noise so that the reconstruction quality can be improved.The output end of the fiber, which acted as a point source due to its small core diameter, was placed at the focal point of a collimated lens with a focal length of 100 mm to obtain plane waves.A neutral density filter was used as an attenuator and a polarizer to obtain linearly polarized light.The polarization orientation could be rotated with a half-wave plate (HWP) so that it could be matched to the optimal polarization direction of the nematic twisted liquid crystal on silicon (LCoS) and a rectangular aperture was inserted to obtain a rectangular profile.The incident laser at the nematic twisted LCoS (Cas Microstar FSLM-4K70-P02) with a resolution of 4094 × 2400 and a pixel interval of 3.74 μm was modulated and reflected, and the reconstructed scene was further enlarged using a Fourier Lens with the focal length of 100 mm.A spatial filter was applied to allow the desired diffraction order to pass through while the other diffraction orders were filtered.The reconstructed magnified 3D scene was captured using a Canon EOS 5D Mark III camera, whose lens was removed and the scene was captured by the CMOS directly.The camera is positioned on a linear guide to adjust its location to capture the 3D scene at different depths.

∑
(I e −  e )(I t −  t ).c 1 and c 2 are two constants, typically assigned values of 0.0001 and 0.0009, respectively.The higher the PSNR and SSIM values, the more similar the two scenes are, indicating superior reconstruction quality.

Figure 2 .
Figure 2. The generation of multi-depth holograms by FCN.A) The computation of the multi-depth diffraction field using the forward-backwarddiffraction framework.B) The structure of the FCN.C) The calculation of multi-depth error.The complex amplitude U n (i = 1, 2, … , 6) and the wave field √ I n (i = 1, 2, … , 6) in (A) are depicted as an image for simplicity, even though they actually consist of real and imaginary parts actually.

Figure 3 .
Figure 3. Occlusion processing.A) The 3D scene featuring "pokers."B) The sampling results.C) The sample slices and "0-1″ masks.D) Processing occlusion relationships with the standard method.

Figure 4 .
Figure 4. Occlusion enhancement.Analysis of A) the standard method and B) L 2 RM.C) Enhancing occlusion effects with L 2 RM.

Figure 5 .
Figure 5. Evaluation for the generation of multi-depth holograms.A) Reconstruction quality of holograms generated by the standard method and L 2 RM.B) Generation time of diffraction fields and holograms with L 2 RM.

Figure 6 .
Figure 6.Quality comparison of reconstructions.A) Target scene.B) Numerical reconstruction of standard method and L 2 RM respectively.C) Optical reconstruction of standard method and L 2 RM respectively.

Figure 7 .
Figure 7.The complex 3D scene and the corresponding hologram.A) Intensity image and B) depth image of the 3D scene.C) The multi-depth hologram generated by FCN.

Figure 8 .
Figure 8.The numerical reconstruction and optical reconstruction of A) WH, B) DPH, and C) L 2 RM.The images in rows 1, 3, and 5 represent numerical reconstruction, while rows 2, 4, and 6 depict optical reconstruction.In columns 1 and 2, the camera focuses on the front focus plane ("football") and the rear focus plane ("guitar") of the "football-guitar" pair, respectively.In columns 3 and 4, the camera focuses on the front focus plane ("airplane") and the rear focus plane ("dog") of the "airplane-dog" pair, respectively.

Figure 9 .
Figure 9. Reconstructed objects at different depth planes.
2 MSE is the mean-square error of the ground truth I t and the evaluated reconstruction I e , and MSE = 1MN∑(I e − I t )2.M × N is their size.eand  t respectively represent their mean values.e 2 and  t 2 respectively represent their variance,  e 2 = 1  t )2. e,t is the covariance between them,  e,t =