The Intelligent Design of Silicon Photonic Devices

Photonic devices based on silicon waveguides are essential to versatile high‐performance and low‐cost photonic integrated systems. Extremely complex silicon photonic devices with hundreds or even thousands of degrees of freedom (DOF) are successfully designed and manufactured based on recent advances in data science and nanofabrication technology. At this level, conventional forward‐reasoning may no longer be suitable for designing high‐performance silicon photonic devices with novel functionalities since the light‐matter interaction is complex and non‐intuitive. Therefore, the timely development of sub‐wavelength silicon photonic devices that can precisely mold the flow of light is a critical and urgent issue requiring joint engineering and scientific efforts. In this paper, an inverse design strategy based on heuristic and gradient descendant algorithms, enabling the realization of large‐scale integrated devices is first introduced. Subsequently, the burgeoning deep learning technology, which offers a promising direction for the automation design of silicon photonics with a data‐driven approach, is discussed. Finally, the obstacles and prospects in this emerging research direction are revealed. Detail discussions from multiple perspectives are provided. This review aims to provide general guidance and a comprehensive reference for scientists developing photonic integrated systems.


Introduction
Silicon photonics focuses on the next generation of semiconductors developed on silicon-based materials and CMOS process platforms.As it is compatible with conventional monocrystalline silicon layer endowed with a designed pattern via etching.The top can be coated with silicon dioxide or exposed to air.
Theoretically, silicon photonic devices can be designed to any shape to control the light precisely.However, the practical implementation thereof is very challenging.Conventional methods for preparing silicon photonic devices rely on forward design, which is based on intuitive and empirical physical approaches -the prototype structure is obtained based on prior knowledge, and the optical properties are calculated by mathematical analysis. [11,12]hen, iterative simulations are conducted by finely tuning parameters to approach the target response. [13,14]This method is suitable for simple photonic devices with several optimization parameters, and it was applied before to design photonic devices manually.Some commonly used templates for that purpose, e.g., Y-branch splitters, [15] Mach-Zehnder interferometers, [16] ring resonators, [17,18] and Bragg gratings, [19] providing many initial structures and design experiences for subsequent research.However, this also has disadvantages since forward design heavily relies on a limited number of templates and physical experience, hindering the implementation of devices with novel structures and functions.An empirical optimization process without scientific guidance is a dead-end, and iterative computing consumes vast resources, enabling the optimization of only a limited number of design parameters, hardly reaching high-performance devices.Besides, the intuitive method is subjective and influenced by the designer's personal preferences, biases, and knowledge limitations, making it hard to adapt to public demand and diversified application scenarios.Evidently, the intuitive design approach is inadequate for silicon photonics, which evolves toward wide bandwidth, high efficiency, and large-scale integration.
Thus, the exploration of the target-oriented inverse design methods is critical.In this review, we mainly focus on two research approaches for the inverse design of silicon photonic devices, i.e., optimization strategies and deep learning.i) Optimization strategies are based on electromagnetic numerical calculation and optimization algorithms.Generally, electromagnetic simulation tools can solve design problems by discretizing Maxwell's equations in frequency and time domains.The optical characteristics of a given structure can be accurately calculated by setting up enough meshes and iterative steps.An optimization algorithm is a framework for guiding the inverse design process.For complex devices with non-intuitive shapes and high-dimensional problems, blind simulation iteration is very resource-consuming, and it is necessary to provide an orientation for numerical computation to successfully find an excellent design in a huge design space.Figure 2 sketches the hierarchical framework of the advanced algorithms in this review.Among these, optimization algorithms consist of three categories: one is the most simple directional binary search (DBS) algorithm, which is the brute search strategy; the second is the heuristic algorithms with global search capabilities, like genetic algorithm (GA) and particle swarm algorithm (PSO); the third is gradient optimization strategies, such as the adjoint (ADJ) method, level set (LST) algorithm, and density topology optimization (DTO) approach, which enable the realization of much more sophisticated devices with much higher designing of DOF.For example, by utilizing the genetic algorithm, Yu et al. [20] designed a polarization converter with a footprint of 0.96 μm × 4.2 μm.This device achieved a minimum loss of 2 dB in the experiment, which is impossible by the conventional intuitive design method.Further, Section 2 presents a new perspective demonstrating how optimization algorithms address silicon photonic inverse design, from the underlying algorithm principles to their exciting research achievements in silicon photonics.ii) In parallel, deep learning is an efficient way to capture complex mapping between the design and response space.Such a data-driven approach boosts silicon photonics research in two directions -forward prediction and inverse design.Forward networks can serve as an effective surrogate model to replace Maxwell's simulator, predicting the optical response of photonic devices.Unlike intuitive design and optimization methods, inverse networks can solve inverse design problems, aiming to retrieve the optimal design from a given target response.
With the assistance of deep learning, Yuan et al. developed ultracompact wave decomposition multiplexers, achieving −2 dB insertion loss, −10 dB low reflection, and −7 dB low crosstalk in their experiments. [21]Remarkably, for devices with many DOF, the search space size exponentially explodes with the design dimensionality (so-called "the curse of dimensionality").Deep learning is, at least for the moment, the best choice for tackling this issue.Section 3 gives the background of deep learning technologies and then discusses recent advances in silicon photonics facilitated by deep learning, focusing on discriminative (like fully connected neural network (FCNN) and generative model (as variational autoencoder (VAE) and generative adversarial network (GAN)) used for predicting optical response and designing devices according to desired optical properties, respectively.

Optimization Strategies for Silicon Photonics
In photonics, dealing with higher design dimensions means more precise light manipulation.However, acquiring an optimal set of parameters for a high-dimensional design space in a limited time becomes virtually impossible as the number of design parameters increases, so the focus is on finding a suboptimal or satisfactory solution.This section will introduce a series of intelligent optimization algorithms that enable designers to overcome A dial illustration of silicon photonic inverse design methodology.The colored middle ring indicates the five major categories of the inverse design method.DBS, heuristic, and gradient algorithms are classical optimization algorithms, and discriminative and generative models are emerging deep-leaning design approaches.The outer sector lists a series of specific algorithms or models.The inner sector depicts the high-performance silicon photonic devices developed via inverse design and the corresponding optical field modulation plots or spectral response characteristic curves.Reproduced with permission. [195]Copyright 2017, The Optical Society.Reproduced with permission. [44]Copyright 2020, The Optical Society.Reproduced with permission. [86]Copyright 2017, Springer Nature.Reproduced with permission. [21]Copyright 2022, The Optical Society.Reproduced with permission. [179]opyright 2021, American Chemical Society.
the limitation of intuitive physical representations, searching for excellent designs to efficiently manipulate and control light.

Direct Binary Search (DBS) Method
DBS algorithm is a primitive optimization method, with the advantage of strong extensibility and simplicity to implement, it has received extensive attention from photonic researchers.Recently, a series of high-performance and compact photonic devices have been demonstrated using the DBS algorithm.One such device is a polarization beam splitter, and after a general explanation of the functioning principle of the DBS method, we show its optimization according to the DBS method, Figure 3.To clearly demonstrate the optimization procedure, a universal method that describes the device structure needs to be defined.The device is approximately discretized into M×N cells, called "pixels", Figure 3a.Each pixel can be operated or unetched and is represented by binarized parameters '0′ or '1′ so that a set of binary arrays can describe the device pattern.The DBS method optimization flowchart is shown in Figure 3b.During optimization, one of the pixels is selected, and its current state is changed (if it is '0′, flip it to '1′, and vice versa).The current operation is kept if the target function is improved; otherwise, it will be undone.The same operation is performed for the next pixel until all pixels are scanned, representing one iteration cycle.The next iteration is continued until the expectation is met or the objective function cannot be raised.Utilizing the DBS method, Shen et al. optimized a polarization beam splitter with a size of only 2.4 μm × 2.4 μm, [22] as shown in Figure 3c.The device is fabricated on the SOI platform, having one and two output arms.The main body is a QR code structure with 20 × 20 pixels, and each pixel has a square shape with a side length of 120 nm.The role of the device is to separate unpolarized light into transverse electric (TE) and transverse magnet (TM) modes, and the figure of merit (FOM) is defined as the average transmission efficiency of the two modes.Polarization beam splitter design by the DBS algorithm. [22]a) Device discretization into M×N cells, called "pixels", showing a single pixel material replacement process; b) DBS method optimization flowchart; c) Schematic of device geometry; d) Comparison of simulated and experimented results.Reproduced with permission. [22]Copyright 2015, Springer Nature.
After ≈140 h, the DBS algorithm iterates the QR code pattern.
Figure 3d shows the simulated and experimental transmittance at 1.450-1.650μm, demonstrating excellent agreement.Figure 4 illustrates other examples of the application of DBS algorithms for the inverse design of massive compact silicon photonic devices: bimodal interferometer, [23] mode transformer, [24,25] modedivision multiplexing (MDM), [26,27] polarization beam splitter, [28] power splitter, [29,30] and 3 dB coupler. [31]lthough DBS-based algorithms have shown huge potential for introducing inverse strategies into the photonic design area.Such a coarse and brutal prototype method also exhibits some significant weaknesses.Two major issues emerged as the research proceeded.The first problem is due to the poor capabilities of the DBS method for searching the global optimization point.As the algorithm is essentially doing local optimization, it is sensitive to initial conditions and very likely to converge to an underoptimized local minimum point.The second problem is about the abusive usage of computational resources.The brutal man-ner that the DBS method adopts is computationally expansive and time-consuming, it requires searching each pixel of the design space, which directly increases the number of electromagnetic (EM) simulations that are demanded, resulting in an exhausting design process, and is unsuitable for devices with large size or number of pixels.In the next two subsections, two advanced algorithms, namely heuristic optimization algorithm and gradient-based optimization algorithm, are introduced to address the abovementioned issue.
for designing photonic devices.The pattern is first encoded in a 1D array, representing a set of design parameters similar to a chromosome.Afterward, an initial population is generated, and fitness values are calculated.New populations are generated by mimicking genetic selection, crossover, and mutation operations, as demonstrated in Figure 5a-c, respectively.Operation selection follows the roulette-wheel rule, where the probability of selecting an individual operation is proportional to its fitness value.The crossover operation selects two chromosomes from the current population to cross and reproduce the offspring.In this step, the new chromosome is expected to combine the respective advantage of the parents.The mutation operation flips or changes the values at random positions of the selected mutant individuals.The act of mutation creates new genes, maintains the diversity of the population, and escapes the local optimum.The selection, crossover, and mutation operations are repeated until the termination condition is met.
GA is a sensible choice for global optimization-seeking problems, widely used in silicon photonics design.Yu et al. [35] used GA to design and fabricate an ultra-compact reflector with a footprint of only 2.16 μm × 2.16 μm (Figure 5d).After ≈48 h of the GA optimization, the simulated reflectivity and the 1-dB bandwidth have reached 97% and 220 nm, respectively.It was critical to consider that the final device structure may severely deviate from the design layout due to the corner rounding effect in the fabrication step.Thus, the authors simulated the effect of different rounding radii at corners on reflectance spectra, and the results indicated that the effect of corners could be neglected.Ex-perimental measurements demonstrated a refractivity of 85% at a bandwidth of 1.440 to 1.640 μm, with the highest reflectivity exceeding 95%.Moreover, a series of Fabry-Perot (FP) cavities with different wavelengths was designed by integrating the reflector with the waveguide.The experimental results showed that these FP cavities exhibit intrinsic quality factors exceeding 2000 in a spectral width of 200 nm (Figure 5e).
As a general and robust design method, GA was used to optimize a series of silicon photonic devices with low loss, wide band, and high integration, as shown in Figure 6.Liu and collages designed and experimentally demonstrated ultrasmall broadband wavelength routers. [36]Ren et al. proposed a highly efficient inverse design method based on GA and neural networks [37] (neural networks will be described in detail in Section 3).They designed power splitters with unusual splitting ratios, a TE mode converter, and a broadband power splitter with a bandwidth of 400 nm.Finally, various functional devices were demonstrated using GA optimization: reflectarray metasurface, [38] grating coupler, [39][40][41][42][43] waveguide filter, [44] ringresonator filter, [45] orbital angular momentum (OAM) emitter, [46] plasmonic nanoarray, [47] and polarization beam splitter. [48]

Particle Swarm Optimization (PSO)
The main challenge of GA is that it converges slowly for highdimensional design scenarios.PSO, by contrast, converges faster and more accurately. [49]PSO was considered in academic and Figure 5. GA-optimized on-chip broadband ultra-compact reflector and Fabry-Perot (FP) cavity. [35]a) The flowchart of GA operation; b) Crossover operation demonstration; c) Mutation operation demonstration; d) Upper panel: an optical image of the FP cavity; Lower panel, left: the SEM pattern of the grating coupler section; Lower panel, right: the SEM pattern of the reflector section; [35] e) Upper panel: experimentally measured normalized transmission spectra of FP cavities, consisting of the spectra from four different wavelength devices ranging from 1.440 to 1.640 μm; lower panels, left and right, indicate the optical resonance spectra near 1.443 and 1.620 μm, respectively. [35]Reproduced with permission. [35]Copyright 2017, Chinese Laser Press.
engineering fields, demonstrating its superiority in numerical optimization problems.It is also an evolutionary algorithm that starts from stochastic solutions and finds an optimal global solution by following the current optimal value.As shown in the flowchart in Figure 7a, the system first generates a particle population with positions p 0 1 , p 0 2 , … , p 0 n and velocities v 0 1 , v 0 2 , … , v 0 n , respectively.Afterward, by evaluating the fitness of each particle, the particle and swarm optimal solution can be obtained.The position and velocity of a particle are calculated as follows: In Equation (1), the particle velocity in the next moment consists of three components: a previous momentum term, a cog-nitive term, and a social term, where w 1 is the inertial weight, r 1 and r 2 are the cognitive and social rates, respectively,  1 and  2 are random coefficients with a uniform distribution between 0 and 1, and pbest n t , gbest n t , and p n t denote the best position of the particle, the best position of the swarm, and the current particle position, respectively.The position of the next moment is obtained by directly implementing the current position and the current velocity.Finally, the particle information is updated, and the iteration process is repeated until the condition is met.More visual details are given in Figure 7b.
power splitter has one input port and four output ports, and the design region comprises tens of rectangular columns, and each rectangular column has the same length but a different width.The width of each particle was optimized by the PSO algorithm, enabling the device to reduce insert loss while maintaining port uniformity.The optimized design resulted in the input light being evenly distributed among the four output channels for a 1555 -1570 nm band.Figure 7d,e illustrates great agreement between simulated and experimental spectra.The measured insertion loss and uniformity are less than 0.76 and 0.84 dB, respectively, indicating the excellent performance of the device.Moreover, based on the proposed structure and inverse design method, this work demonstrates extension potential to splitters with arbitrary numbers of splitting ratios or ports and is ex-pected to be realized on other material platforms in the future.Figure 8 illustrates the PSO algorithm applied to design silicon photonics with meta-material types, e,g., power splitter, [15,51] waveguide crossing, [52] polarization beam splitter, [53] mode order converter, [54] polarization rotator, [55] and logical gate. [56]

Adjoint Method (ADJ)
Various silicon photonic devices have been designed using heuristic algorithms, proving the effectiveness of this approach.However, heuristic algorithms are not always straightforward, Figure 7.A 1 × 4 optical power splitter optimized via the PSO algorithm. [50]a) PSO algorithm flowchart; b) Particle velocity and position update strategy; c) The structural diagram of the device.Simulation and experimental results for d) insertion loss and e) uniformity loss.Reproduced with permission. [50]Copyright 2013, The Optical Society; b) Waveguide crossing.Reproduced with permission. [52]Copyright 2013, The Optical Society; c) Polarization beam splitter.Reproduced with permission. [53]opyright 2020, The Optical Society; d) Mode order converter.Reproduced with permission. [54]Copyright 2015, The Optical Society; e) Polarization rotator.Reproduced with permission. [55]Copyright 2014, The Optical Society; f) Logic NOT gate.Reproduced with permission. [56]Copyright 2018, IEEE; g) 2 × 2 power splitter.Reproduced with permission. [51]Copyright 2016, The Optical Society. Figure 9.A 3-dB divider and a mode demultiplexer optimized through the ADJ method. [57]a) The schematic diagram of the ADJ method to calculate the gradient.b) Left: the SEM image of the fabricated 3-dB divider; middle: the simulation results of insertion loss; right: measured excess loss profiles; c) Left: the SEM image of the fabricated mode demultiplexer; middle: the simulation results of insertion loss and crosstalk; right: experimentally measured port transmission characteristics.Reproduced with permission. [57]Copyright 2020, Chinese Laser Press.
especially for devices with complex geometries or functions, leading to colossal computation source consumption and unacceptable latency time.In that instance, gradient algorithms are more favorable.The critical distinction between gradient and heuristic algorithms is how they update the objective function.The former optimizes the objective function by a gradient updating strategy, while the latter yields better solutions through a random search method.Two crucial factors should be considered to harness the gradient method for efficient photonic design better.Primarily, given that the gradient algorithm entails multiple gradient computations, which can be time-consuming, a powerful and accurate electromagnetic solver becomes essential.Adopting gradient-based methods is now feasible thanks to improving the current hardware platforms and software algorithms.Additionally, the objective function must meet the continuous differentiability condition.The ADJ method is an efficient numerical calculation method for optimizing the extreme values of FOM along the gradient orientation. [57,58]Not limited by the number of optimization parameters, with only one forward and adjoint simulation, the shape derivation of all points in space can be calculated to improve the value of the objective function. [59,60]o clearly demonstrate the design of photonic devices with the ADJ method, we describe a readily comprehensible instance with a series of equations.In Figure 9a, we define the FOM as |E(x 0 )| 2 , representing the energy of the electric field at a given point x 0 .The change of FOM due to the perturbation of ΔE(x 0 ) can be indicated as follows: where E old is the electric value before the change; when the perturbation is small enough, we consider E new approximately equal to E old .For the design region , which is a rectangular area composed of M×N pixels, filled with different  at various positions.
The key issue that needs to be considered is adjusting the ΔE(x 0 ) to obtain a better FOM.In electromagnetism, a small change in the dielectric permittivity Δ r of a sufficiently small volume ΔV at x can induce an electric dipole moment, which can be expressed as follows: The electric field change at the objective position x 0 caused by the dipole moment at x is specified as follows: where G EP (x 0 , x) is Green's function, which characterizes the effect of the unit electric dipole at site x on the electric field at x 0 .Therefore, we obtained the gradient information of FOM: So far, although each variable of E old (x 0 ), G EP (x 0 ), and E old (x) can be calculated or simulated, it is necessary to run a simulation for each position x, which is not a sensible approach, but it can still be improved.Applying the reciprocity theorem, we obtain: Thus, Equation (4) can be rewritten as follows: and where E adj (x) is the adjoint electric field, which can be easily obtained by applying a dipole with an intensity ofΔVE old (x 0 ).In brief, the electric field E old (x) at all design regions is obtained by one forward simulation, and the adjoint electric field E adj (x) at all design regions is obtained by one adjoint simulation.This allows effective calculation of the gradient of every position in the design area.Despite the simplicity of the particular case, this theory can be extended for more complex scenarios.
The ADJ method was invented to solve control problems [61,62] and then extensively developed for shape and topology optimization in aeronautical [63] and mechanical engineering. [64,65]Its application has been broadened to silicon photonics in recent years, which significantly promotes the design efficiency of devices.Ren et al. proposed a digitized ADJ method for the inverse design of a single-mode 3-dB power splitter and a dual-mode demultiplex. [57]The 3-dB power splitter exhibited a compact footprint of only 2.6 μm × 2.6 μm, as shown in Figure 9b.In the range of 1530-1570 nm, the experimentally exhibited average loss was 0.44 dB, with a fluctuation within 0.40 dB.The mode demultiplexer with the design area of 2.4 μm × 3 μm, Figure 9c, achieved an average insertion loss of 1.36 dB, and the crosstalk was less than −20 dB over the band from 1530 to 1570 nm in simulation.The measured results showed that the insertion loss and the crosstalk were less than 1.51 and 18 dB, respectively.This method could perform the optimization up to 5 times faster than a conventional DBS approach to achieve equivalent performance levels.
Conventionally, the ADJ algorithm is incompatible with digital photonic devices since the gradient of the digital structure cannot be calculated.For this reason, the authors adopted continuous grayscale optimization, linear bias, and brute-forced quantization methods to overcome this drawback without losing the benefit of efficient computing.However, these post-processing operations, especially the thresholding, inherently bring in an approximation.Such approximation can lead to suboptimal designs that may not fully capture the performance benefits identified during the grayscale optimization.The performance degradation can be disparate based on the specific device, initial grayscale guess, and thresholding technique adopted.It would be advisable to compare the simulation performance of structures before and after post-processing.This technique deserves affirmation if the difference is negligible or the post-processed performance remains acceptable.The article cleverly employs a ternary strategy to reduce performance degradation during digitization.It quantizes grayscale values using multi-level thresholds into various aperture sizes.The middle image in Figure 9b displays the insertion loss for the ideal grayscale structure, ranging from 0.31 to 0.33.After the ternary operation, the insertion loss is slightly increased to 0.32-0.34.In this instance, the post-processing and binary approximation have a weak effect on the performance.To further mitigate such impacts, the adjoint method can be used for initial optimization in the grayscale space, followed by other optimization strategies for the discrete space.While these approaches are computationally intensive, it remains a viable strategy.
More recently, Qi et al. [66] proposed and experimentally demonstrated an approach based on an improved ADJ method to realize an ultrafast, ultracompact, ultralow power consumption integrated photonic circuit on an SOI platform.As illustrated in Figure 10, the entire system was integrated on a chip with a footprint of only 2.5 μm × 7 μm.It was composed of three parts, two all-optical switches, and an XOR logic gate, each component with a size of 2 μm × 2 μm and a pitch of 1.5 μm.The optical switch lets the light successively pass through a disordered nanostructure when only signal light is input, but when the signal and control light signals are input simultaneously, the coherent superposition of the electromagnetic alters the mode field distribution of the signal light, preventing the signal light from passing through the device.The experimental results indicated that the ON/OFF contrast was greater than 6 dB and up to 10 dB at wavelengths between 1500 and 1600 nm.For the XOR logical gate, the output energy could only be detected when there was one branch input signal.It experimentally exhibited a higher "1′/"0' contrast over 15 dB.By integrating different devices, the silicon photonic Figure 10.The integrated photonic circuit realized through the ADJ method. [66]a) The configuration diagram of all-optical integrated circuits; b) Simulation (left) and experimental (right) results of the optical switch transmissivity; c) Simulation (left) and experimental (right) results of the optical XOR logic gate transmissivity.Reproduced with permission. [66]Copyright 2022, Chinese Sci, Inst Optics & Electronics, Ed Off Opto-Electronic Adv.circuits implemented a logic gate controlled by the all-optical switch with a theoretical response time of 150 fs and a threshold intensity of signal and control light within 10 fJ/bit.Besides, it can also be employed to check the equality of two two-binary numbers.

Level Set (LST) Optimization
The ADJ algorithm was often combined with other methods, such as LST [77,78] and density topology optimization (DTO). [79]ST optimization is a numerical method of interface tracing, representing the motion of a curve at an interface in an implicit way. [80]It is particularly suitable for the topological design of photonic devices. [81]The introduction of the LST approach allows us to parameterize the design region with an implicit function and accurately track the topological variations of the boundaries.Meanwhile, it can easily constrain the minimum feature size to enhance the robustness and ensure the structure is fabricable. [82]s shown in Figure 12a, we define the material distribution in the design region as a potential function ɸ.For design areas filled with material or air, we have: where  1 and  2 represent the relative dielectric constant of the material and air, respectively.All material and air interfaces have the potential of ɸ = 0, implicitly represented as a zero-level set.
During the optimization process, the potential function of the design region evolves over time according to the Hamilton-Jacobi equation [83] : where ɸ t denotes the change rate of the potential function over time, ∇ϕrepresents the spatial gradient of ɸ, v n is the normal velocity at each point on the contour.Since the variation of the curve profile is only related to the normal motion on the curve and not to the tangential motion, only the normal velocity needs to be considered in the curve evolution.It is important whether the direction of each point on the profile moves inward or outward to optimize the device toward better performance.Here, we choose that the normal velocity v n corresponds to the gradient of the FOM function, which can be rapidly calculated by the ADJ method mentioned earlier.After a certain number of iterations, the FOM function converges to a locally optimal value.The perfect combination of the LST method with the ADJ algorithm tends to reduce the time consumption of the inverse design process, and Hu et al. demonstrated its superiority. [70]In their work, they presented the performance of the combination of LST with ADJ algorithms by designing and experimentally demonstrating a silicon photonic Y-junction, Figure 12b.By optimizing the profile with a 2 μm × 2 μm area, the light was maximally transmitted and equally divided between the two ports.Initially, the device was equivalent to a 2D structure to reduce the computer source requirements, and 2D FDTD was used for simulation.After 41 iterations, the problem reverted to 3D exact simulation for more iterations.It took only 51 iterations (102 simulations) to reduce the minimum insert loss to −0.07 dB (Figures 12c and 2), which was significantly better than in the previous work using the PSO Reproduced with permission. [67,69]Copyright 2023, 2020 The Optical Society; c) Waveguide taper.Reproduced with permission. [73]Copyright 2018, The Optical Society; d) Polarization splitter rotator.Reproduced with permission. [72]Copyright 2022, IEEE; e) Metagrating.Reproduced with permission. [76]Copyright 2017, American Chemical Society; f) Four-port 3-dB coupler.Reproduced with permission. [71]Copyright 2020, IEEE; g) Metalense.Reproduced with permission. [74]Copyright 2020, The Optical Society; h) Nonlinear photonic switch.Reproduced with permission. [60]Copyright 2021, American Chemical Society.
method, [15] where an insertion loss of -0.13 dB was achieved with 1500 simulations.The improved performance of the ADJ steepest descent method compared to the PSO algorithm relies on taking full advantage of the underlying Maxwell's equation physics.
Striving toward enhanced performance, optimization algorithms always design devices with exceptionally small feature sizes, while foundry lithography is limited by the minimum area, enclosed area, and curvature. [84]Consequently, ensuring a fabricable layout is a significant issue that needs to be addressed for inverse design. [81,85]In this respect, Piggot et al. [86] proposed a general inverse design method that directly incorporates fabrication constraints.The basic algorithm of this approach is also based on a combination of the LST and adjoint method, and by utilizing Equation (11), it periodically imposes curvature constraints to avoid extremely small feature size.
in which  is the local curvature, specified as follows: and b(k) is the control factor, specified as follows: This force constraint smooths the boundaries with a local curvature greater than  0 to satisfy the fabrication condition.In addition, morphological expansion and erosion operations were adopted to eliminate narrow gaps and bridges in the design region.To verify the feasibility of this method, a spatial mode demultiplexer (Figure 13a), a 1 × 3 splitter (Figure 13b), and a directional coupler (Figure 13c) were designed.They fabricated the 1 × 3 splitter on an SOI platform, as shown in Figure 13d.With an ultra-compact footprint of 3.8 μm × 2.5 μm, the device achieved an insertion loss of 0.642 ± 0.057 dB and a power uniformity of 0.641 ± 0.054 dB over a band of 1300-1700 nm in experimental measurements, which was superior to previous reports [87,88] in both size and bandwidth.Other representative devices developed based on LST are depicted in Figure 14, including wavelength duplexer, [78,81,85,89,90] polarization converter, [81,91] power splitter, [92] quasicrystal, [93] and gratings, [94,95] suggesting the potential usefulness of inverse design for silicon platforms.LST algorithm combined with the ADJ method for optimizing a Y-splitter. [70]a) LST representation and optimization strategy for photonic devices.b) The geometry of the optimization region.c) Coupling efficiency evolution during the optimization.d) Coupling efficiency from 1500 to 1600 nm of the final optimized structure.Reproduced with permission. [70]Copyright 2013, The Optical Society.

Density Topology Optimization (DTO)
Apart from the algorithms mentioned above, DTO [96] is also a gradient algorithm extensively used in the nanophotonic inverse design.Generally, the design region should be discretized into pixels, and each pixel can only be filled with a solid dielectric material or air.In DTO, the material permittivity of each pixel can be expressed as a continuous value to facilitate the calculation of gradients.One of the representations [97] can be written as follows: where  1 and  2 denote the relative dielectric constants of air/SiO 2 and silicon, respectively, and  ij is the density index, ranging from 0 to 1. DTO can be executed in three steps, as shown in Figure 15a.First, the permittivity of each point is randomly initialized to a value in continuous space in  2 and  1 ; this means that  ij takes a value between 0 and 1. Afterward, the derivatives of FOM with respect to  ij were obtained by sensitivity analysis [98,99] to improve performance.The third step entailed discretizing the optimized density parameters into binary parameters.Pixels represented as intermediate values could not be fabricated because they did not correspond to either material.A common way to map the design parameters to the physical structure is to iterate the optimization and apply biasing effects until the density function converges to 0 or 1. Piggott et al. [100] developed an inverse design approach based on ADJ, LST, and DTO algorithms.They designed an on-chip wavelength demultiplexer.In the first stage of the design process, the permittivity of the optimized region was allowed to take values with a continuous range of permittivity for silicon and air.Then, the device structure was initially optimized using the steepest descent method. [95,101]The calculation of the local gradient of the steepest descent method was achieved by the forward and adjoint electromagnetic simulation.The continuous-valued permittivity was converted into binary parameters in the second stage by applying bias.The LST function then represented the entire optimization region, and the optimization was continued in combination with the steepest descendent method.Finally, the objective function was extended from two points to ten target frequency points.Literature data [100] state that broadband optimization is expected to reduce the impact of fabricated defects on device performance and thus improve the robustness of the design.The entire inverse design process took 36 h, and the fabricated device had an ultra-compact size of 2.8 μm × 2.8 μm, and its SEM image is shown in the left panel in Figure 15b.The right image in On-chip wavelength demultiplexer (WDM) designed through a hybrid strategy based on the ADJ method, LST algorithm, and DTO approaches. [100]a) The process used in the DTO method for designing silicon photonic devices.Left: Grayscale initialization.Middle: Continuous density optimization.Right: Binary layout.b) Left: SEM images of the fabricated WDM.The middle and right panels are the S-parameters from the FDTD simulation and experimental measurement, respectively.Reproduced with permisssion. [100]Copyright 2015, Springer Nature.

Figure 15b
represents that the insert loss at 1300 and 1550 nm are −1.8 dB and -2.4 dB, respectively, while the channel crosstalk is below −11 dB.Their work was developed for the silicon photonic platform and can be extended to the inverse design of devices for other material systems or functions.
Hammond et al. [84] proposed an automated design method based on DTO, integrating minimum linewidth, line spacing, curvature, area (for islands), and enclosed-area (for holes) design constraints.This satisfied the foundry design rules checks (DRC), as shown in Figure 16a.In addition, they analyzed the effect of under/over-etching on robustness to predict the worstcase device performance (Figure 16b).To reduce the effects of etching error, they applied conical pattern filters in the design region (Figure 16c).The application of density topology optimization methods was also reported for various silicon photonic designs involving metalens, [102,103] mode converter, [104] wavelength demultiplexer, [105] 90°bend, [106,107] Z-bend, [108] polarization splitter, [109] mode multiplexer, [110] and wavelength router, [111] as summarized in Figure 17.

Deep Learning in the Acceleration of Silicon Photonics Research
With the improvement of computability and the appearance of big data, deep learning ushered in a wave of explosive developments.Within the last decade, deep learning development platforms have become increasingly mature.One framework is developed based on Python, like TensorFlow [112] for mainstream industrial applications, PyTorch [113] primarily for academia, and Keras, [114] a third-party high-level API.Another framework is deployed on Matlab, [115] providing many deep learning models and sufficient tutorial cases.Initially, these frameworks could be built only on Linux systems, but they can nowadays be applied on Windows systems and support GPU acceleration, dramatically shortening the training time.Moreover, AI developers enabled vast resources with open and shared minds, including papers, datasets, courses, books, tutorials, and algorithmic frameworks, providing a guiding ideology and realization approaches for other researchers. [116]eep learning has a profound influence in the realm of autonomous driving, [117] robotic controls, [118] language translation, [119,120] audio recognition, [121] and image classification. [122]Recently, it has been introduced in the domain of integrated photonics as a powerful way to study complex mappings of microstructural and response properties.Deep learning is a data-driven approach that uses deep neural networks to extract features from datasets without human involvement.In contrast to the previously described inverse design of the optimization algorithm, which requires extensive iterative electromagnetic (EM) simulations and time consumption, deep learning is extremely efficient, and a well-trained neural network for designing photonic devices can be almost instantaneously obtained.There are two main approaches for deep learning to accelerate photonic research -forward and inverse models.In the forward model, the neural network acts as an agent solver to predict the optical response in a short time.In the inverse model, the neural network directly outputs required design parameters according to the target response.This section briefly introduces the principle and architecture of various deep learning models and reviews classical works that address the inverse design problems using deep learning techniques.

Fully Connected Network
Among various discriminative models, we illustrate here only the two most basic architectures, which are also commonly used in the nanophotonic design: fully connected network (FCN) and Figure 16.The inverse design of silicon photonic devices based on the DTO method and semiconductor process constraints. [84]a) DRC checked out of the broadband mirror, broadband bend, and broadband T-splitter violation region.b) Etch pattern prediction.Left: dilated.Middle: standard.Right: eroded.c) A conical filter applied for geometry constraints on the minimum line width and line spacing.Reproduced with permission. [84]Copyright 2021, The Optical Society.convolution neural network (CNN).Figure 18a shows an FCN consisting of an input layer, an output layer, and one or more hidden layers.Each node represents a neuron, and the neurons between adjacent layers are connected.In a forward inference process, Figure 18b, the front layer neuron performs a two-step operation to deliver information to the back layer neuron, as described in Equation ( 15): where x i and h denote the values of the current and back layer neurons, w i is the connection weight of the current neuron and the back layer neuron, b is the bias term, a is an intermediate variable, and f is a derivable nonlinear activation function.w i and b are determined by training, and f is specified by the designer.In this way, the neuron information is transferred from front to back, and the prediction results are output at the end neuron.
Before explaining the training process, we need to interpret the concept of datasets, which is a collection of many data-label pairs.As in photonics systems, the design parameters of a device and its optical properties (reflection, transmission, amplitude, or phase) can be understood as a data-label pair.Datasets are divided into train and test sets.The training set is involved in the training process, while the test set is involved only in the forward prediction procedure used to evaluate the accuracy and generalization ability of the network.Since the initialization of the weights and bias parameters is random, the raw network cannot accurately capture the relationship between the inputs and outputs in the dataset, resulting in a large error between the prediction results and true values.This error is quantified by a loss function, generally specified as the mean squared error, as in Equation ( 16): Reproduced with permission. [102,103]Copyright 2020, American Chemical Society, The Optical Society; c) Mode converter.Reproduced with permission. [104]Copyright 2013, IEEE; d) Wavelength demultiplexer.Reproduced with permission. [105]Coyright 2020, The Optical Society; e) 90°bend.Reproduced with permission. [106]Copyright 2004, AIP Publishing; f) Z-bend.Reproduced with permission. [108]Copyright 2004, The Optical Society; g) Polarization splitter.Reproduced with permission. [109]Copyright 2022, The Optical Society; h) 90°bend.Reproduced with permission. [107]Copyright 2008, The Optical Society; i) Mode multiplexer.Reproduced with permission. [110]opyright 2016, The Optical Society; j) Wavelength router.Reproduced with permission. [111]Copyright 2021, MDPI.
where ŷi and y i denote the predicted and true values of a data set, respectively, and N is the batch size.The error correction process is implemented by the chain rule of the derivation update, as shown in Equation ( 17 In this way, the partial differential equations could be solved for each weight and bias.Finally, all the parameters are updated based on the stochastic gradient descent strategy: where  is the learning rate, yielding the loss between the actual output with the target output and its gradient.The required modified quantity of weights (and biases) is calculated, and new connection weights (and biases) are obtained.The weights (and bi-ases) are repeatedly adjusted to minimize the error, from which the optimal connection parameters are learned, as shown in Figure 18c.To improve the prediction accuracy on the test set, the user should tune various hyper-parameters according to the network performance, including batch size, learning rate, activation function, the number of layers and nodes, loss function, and others.Refs.125][126][127] Some operations might be needed to avoid over-fitting, such as dropout, [128] model simplification, [129] regularization, [130] crossvalidation, [131,132] batch normalization, [133] or early stopping. [134][137] These techniques for training and preventing overfitting are also compatible with the CNN architecture.Figure 18d presents a typical application of deep neural networks in detail.Taherisma et al. [138] designed a series of beam splitters with different splitting ratios on an SOI platform.Each device possessed an ultra-compact 2.6 μm × 2.6 μm footprint with a feature size of 90 nm, which was compatible with the current fabrication technique.The simulation results showed that all the splitters exhibited transmission efficiencies of over 90% and a bandwidth of over 200 nm.The design region was divided into 20 × 20 pixels, and each pixel could be unetched or etched with In the forward model, the design parameters are regarded as input, and the response properties are treated as labels, and vice versa in the inverse model. [138]eproduction with permisssion. [138]Copyright 2019, Springer Nature.
a circular through-hole with a 45 nm radius, denoted by 0 and 1, respectively.Thus, the design pattern can be represented as a binary array with a length of 400, denoted as X.Another sequence expressed the corresponding output spectral characteristic, noted as Y. Through EM simulation, they obtained a dataset of 20 000 design parameters-response label pairs.They constructed two network models for forward prediction and inverse design.The forward modeling was a regression problem with a Gaussian log-likelihood loss function for training; the inverse modeling was attributed to a classification problem with a Bernoulli loglikelihood classifier as the loss function for training.In particular, they utilized Deep Residual Networks (ResNet), [139] solving the gradient dispersion issue for neural networks with deep layers.Through the "identity shortcut connections" strategy, the forward and inverse networks were deepened to 8 hidden layers while ensuring forward and backward gradient smooth propagation, which is inaccessible for general networks.This work sheds new light on further (silicon) photonics development.Although a large dataset for deep learning is initially required for training, it could be implemented in parallel on a high-performance computing cluster.Once the training is completed, the network can infer the desired design parameters in less than a second, which is unimaginable for other methods.
A remarkable issue worth studying is one-to-many mapping for the inverse model, which has appeared in previous works. [140]nlike the forward model, where a given design parameter certainly corresponds to a unique response characteristic, multiple device layouts possibly exhibit the same optical properties for the inverse model, as depicted in Figure 19a.The nonunique mapping makes the parameter space nonconvex, violating the gradient-based updating strategy of deep learning.It oscillates during training, and the network fails to converge.
To tackle this problem, some early work from microwave device design and research areas has proposed dividing the data into groups that eliminate duplicates.The data was not used to train the inverse network directly in their work. [141]Instead, it underwent a pre-soring process based on gradients and was segmented into multiple groups, ensuring that each group was free from the problem of multiple solutions, as presented in Figure 19b.After that, each data group was utilized to train its respective inverse Reproduced with permission. [140]Copyright 2018, American Chemical Society; b) A method that resolves the issue of multiple solutions in inverse designing through partitioning the training data.Reproduced with permission. [141]Copyright 2008, IEEE; c) A dimension reduction design methodology for electromagnetic nanostructures based on autoencoders.Up: a forward model consisting of cascaded a pseudo-encoder and a decoder, employed to map the reduced design space to the response space. [142]Middle is a meta-surface with configurable reflectivity, [142] and Down represents the multi-layer thin-film structures. [143]Both of them are developed through dimensionality techniques of the autoencoder; Copyright 2020, Springer Nature, and Copyright 2021, The Optical Society; d) A tandem neural network to solve the nonunique issue.During the training, weights in the forward model are fixed, while those in the inverse network are modified to minimize error.After the training, the design parameters are extracted from the intermediate layer; [140] e) The MDN technology for dealing with the issue of one-to-many mapping.Reproduced with permission. [146,147]Left and Middle: in the inverse engineering paradigm of MDN, response properties are mapped to multiple mixed Gaussian distributions that contain the design information, rather than to a single set of design parameters.Right: the MDN captures all degenerated through multimodal distribution, with mode strength indicated by strip opacity.Reproduction with permission.Copyright 2020, American Chemical Society.
sub-model.Since the inverse sub-model maintains a one-to-one mapping between its inputs and outputs, non-convergence is effectively eliminated during the training.With the aid of the welltrained forward model, the specified inverse sub-model is identified for different input values and assigned to achieve the mission of inverse design.In the case of inverse modeling for spiral inductor, the test error was 13.6% for the direct modeling approach, while through such a method, it was reduced to 0.05%.While these methods have been proven to be applicable, the device structure is considerably simple.They might not be feasible for photonic applications where non-uniqueness is much more severe as countless groups have to be separately divided, and the boundary of different groups becomes vague.
To alleviate the one-to-many problem, some recent research was inspired by traditional machine learning techniques and proposed introducing dimension reduction methods.Dimension reduction is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data.In the ideal case, the reduced part of information can be treated as the noise of redundancy, and the rest of the representative vector in lower dimensional space still contains all the information that is required.Therefore, if all possible solutions in a one-to-many problem share identical properties, these features can be extracted through dimensional reduction methods, and the networks will be trained with no convergence issue.Figure 19c demonstrates an inverse design model by training an auto-encoder as a dimension reduction agent.As the number of nodes in the bottleneck layer (encoder) decreases, the neural network tries to compress critical information into lower-dimension vectors.To validate the effectiveness of this method, a meta-surface with reconfigurable reflectivity formed by a periodic array of Au nanoribbons on top of a thin layer of GST on a SiO 2 substrate has been designed. [142]Given solid proof that the method did have good performance in dealing with non-uniqueness problems.Another inverse design on multi-layer thin-film structures has also been done later using the same strategy. [143]Despite the fruitful achievement, it has to be mentioned that the treatment of dimension reduction is just some alleviation method.It suffers from limited applicability as the reduced spaces are not guaranteed to map the response and parameter space in a one-to-one manner.
Another approach that has been proposed is tandem network, as presented in Figure 19d.The forward network was first trained alone, and the weight parameter was fixed after training completion.Then, the inverse network was combined with the forward network with fixed weights and trained together.The error of the output and input optical properties was used as the loss function, and only the inverse network was updated during training.The redundant solutions of the inverse problem were eliminated under the constraint of the forward network.Such schemes are effective for inverse design problems, such as the latest report on inverse design nanophotonic geometries according to the target response, settled through the tandem framework. [140]A modified method was reported as an extension of Ref. [140]. [144]This work used data-driven methods to determine the suitable initial structures, and additionally, the Knowledge-Dependent Algorithm was employed for fine-tuning.The excellent result reveals that it offers a highly efficient and accurate solution for inverse design.In Ref. [145], the SiO 2 grating surface with high diffractive efficiency and relative broadband was inverse-designed via the tandem network.These successful projects show that the tandem network is a mature and efficient way to tackle non-unique mapping issues.Tandem networks relax the network converging condition requirements, thereby alleviating the non-uniqueness issue.However, it cannot completely solve it.The strategy lacks abilities to search global optimal and still has the risk of dropping in an oscillation situation; either the complication of the system that is required to predict significantly increases or the forward network is not set or trained properly.
Additionally, some latest research has suggested not directly related the neural networks' output to the discrete points in parameter space.Instead, the network shall predict a probability density distribution in parameter space while a desired response is given as input.The probability distribution can have more than one distribution center.Therefore, the non-uniqueness issue can be solved delicately.A schematic description of such a structure has been shown in Figure 19e.Rohit Unni and his group have proposed an approach termed Mixture Density Networks (MDNs) based on the idea. [146,147]Since the inverse network can only give distribution in parameter space, additional steps are required in terms of sampling and fine-tuning the "raw" design.The training procedure could also be more complicated than the abovementioned strategies.Nevertheless, this novel approach shows great potential and is competitive while handling the non-uniqueness issue in neural network-based inverse design methods.
Finally, there are other ways to circumvent the one-to-many problem in the inverse design.One is the hybrid optimization strategy, where the trained forward network is utilized as a surrogate predictor, and the optimization algorithm is applied for inverse design, which is still several orders of magnitude faster than conventional EM optimization methods, for which a specific example can be found in Ref. [148] Another approach is the model generation, which is further explained in the subsection 3.2.Of course, all these schemes that tackle nonunique mappings are compatible with the CNN model that will be introduced below.
The FCN architecture has been widely adopted to assist in designing various silicon photonic devices.Figure 20a shows an all-dielectric metasurface system designed by the FCN. [149]The neural network exhibited the capacity to accurately predict frequency transmission characteristics and emerged more than five orders of magnitude faster than the conventional electromagnetic simulation software.Similarly, as the aggregated grating of Figure 20b. [150]Figure 20c [151] and d [152] present another example of developing a grating coupler with deep learning, including forward modeling and inverse design.Lately, the FCN has extended to a wider range of communication-relevant photonic platforms, including Bragg grating, [153] directional coupler, [154] nanophotonic waveguide, [155] and power splitter, [156] as sketched in Figure 20e-h.

Convolutional Neural Network (CNN)
Despite the FCN that offers an efficient tool for many photonic tasks, it is not suitable for modeling structures with high degrees of freedom.On the one hand, the dense-full connection requires numerous weight parameters, making the network hard to train.On the other hand, 2D/3D photonic devices must be reshaped as input 1D vectors to be compatible with the FCN, breaking the correlation in the vertical direction.The FCN cannot capture the spatial features of the data, leading to the occasional performance deviation. [157]Fortunately, these difficulties were overcome in the framework of CNNs.Analog to the FCN, the CNN consists of an input layer, one or more hidden layers, and an output layer, [158] as shown in Figure 21.The difference is that the CNN is connected to the latter layer by a local convolution operation, and the convolutional weight sharing and the interlayer sparse connection structure dramatically reduce computational consumption and enhance the generalization ability and stability of the network.As an ideal candidate for processing high-dimensional data, CNNs are applied in silicon photonic research.Taherisma et al. [159] demonstrated a remarkable work of CNN assistance in the inverse design of a 1 × 2 power splitter, presented in Figure 22a.The design area is a 2.6 μm × 2.6 μm square, divided into 20 × 20 pixels available for adjustment, corresponding to a design space of 2 400 combinations.The introduction of CNN allows better extraction of image features in the pictorial power divider, and the model achieves excellent generalization with such an ample search space.In addition, their work demonstrated that: introducing the ReLu activation function in the neural network helps alleviate the embarrassment of gradient disappearance; using a dropout rate of 0.4 helps relieve overfitting; and, local batch  [149,150] Copyrigtht 2019, The Optical Society, Copyright 2018, AIP Publishing; c,d) Grating coupler.Reproduced with permission. [151,152]Copyright 2021, IEEE, Copyright 2019, IEEE; e) Bragg grating.Reproduced with permission. [153]Copyright 2019, Optical Society of America; f) Directional coupler.Reproduced with permission. [154]Copyright 2019, The Optial Society; g) Nanophotonic waveguide.Reproduced with permission. [155]Copyright 2023, The Optical Society; h) Power splitter.Reproduced with permission. [156]Copyright 2017, Chinese Laser Press.
normalization before convolution operation helps enhance the generalization capacity.They tested the performance of the CNN and FCN separately on a random test set (see Figure 22b), and the results exhibited a correlation coefficient of 0.85 for the CNN, much higher than 0.16 for the FCN, as shown in Figure 22c.
As mentioned in the introduction section, deep neural networks often serve as a much less computationally expensive alternative to traditional numerical EM simulation methods, which map the spatial distribution relation between refractive index and electromagnetic field.However, simply training deep learning neural networks (DLNNs) with numerically derived inputsoutput pairs encounters two severe issues.First, the size of data sets required to train such a neural network might become massive as the complications and dimensions of the modeled photonic system increase.Second, no physical laws are introduced as constraints during the training process.As a result, there is no guarantee the simulation result should follow the fundamental law of physics.
To address the issue, facilitating the learning algorithm to capture the right solution even with fewer training examples, an emerging technique termed Physics Informed Neural Networks (PINNs) is introduced, [160] as presented in Figure 23a.As most of the physical laws that govern the dynamics of a system can be described by partial differential equations (PDEs), PINNs try to solve the governing partial differential equations of physical phenomena using deep learning algorithms, in particular, replacing the traditional numerical solver with a DLNNs that approximate the solution to PDEs.
The law that governs the flow of light in a photonics system is Maxwell's equations, which is also a set of PDEs.It is then pretty straightforward to introduce this method into the photonic inverse design area.To obtain the approximate solution of PDEs   [159] a) Overview of the CNN recognition and prediction process; b) Examples of randomly generated test datasets.c) Correlation coefficient between the target with predicted transmission spectra is 0.85 for the CNN and 0.16 for the FCN.Inset: train and test data distributions.Reproduced with permisssion. [159]Coyright 2019, CLEO.
via deep learning, a key step is to constrain the neural network to minimize the PDE residual.In this manner, prior knowledge of Maxwell's equations can act as a regularization agent that limits the space of admissible solutions.Several approaches have been proposed incorporating PDEs residual as loss or regularization term while training DLNNs to predict nonlinear optical response of microlenses (see Figure 23b), [161] near-fields in periodic, high dielectric contrast nano-structure arrays [162] (see Figure 23c) and scattering properties of multi-components nano-particle structures (see Figure 23d). [163]High prediction accuracy with a small set of training data has been observed, indicating great potential for such methods to be applied in photonic inverse design.

Generative Model
Discriminative models have achieved significant success in silicon photonics but still have some limitations.For the inverse design is always necessary to pre-train a forward model as an agent solver.Either the inverse network or the optimization algorithm is based on the forward network, and the error of the forward model transmits to the next model.Therefore, the forward network needs to be trained very accurately.The high accuracy of the discriminative model is driven by big data, which is prone to failure in a large design space due to insufficient training data compared with its vast search space, severely limiting its application.Generative models could alleviate the burden of dataset collection to some level.They learn critical features of high-performance device patterns in the training set and then mimic them to generate new designs containing parameters with better optical properties.The fundamental dataset analysis yields a more generalized effect on a limited dataset.Moreover, this model can quickly solve the one-to-many mapping in the inverse design. [164,165]This subsection presents the variational autoencoder (VAE) and generative adversarial networks (GAN) and their outstanding work for silicon photonic development.

Variational Autoencoder (VAE)
To illustrate the VAE, we first need to introduce an autoencoder, which is applied for high-fidelity data compression and reduction. [166]It comprises two parts, an encoder and a decoder, where the encoder compresses the high-dimensional data into a low-dimensional latent vector containing the main features of the dataset; the decoder maps the latent vector back to the original data with the maximal possible reduction. [167,168]However, the autoencoder does not possess the generation capacity since the latent space is not regularized; only a small fraction of the latent space corresponds to the principle features of the highdimensional dataset, and the rest of the space does not have a specific function, resulting in random sampling in the latent space, likely producing a garbage output that does match the dataset features.VAEs are advanced variants of autoencoders that overcome  [160] Copyright 2021, Society for Industrial and Mathmatics Publications; b) Methodology for Lens optimization via physic-driven neural network.Here, the initial lens design is fed into MaxwellNet, yielding EM response, and the objective function is derived via Green convolution and optimized through backpropagation until it converges.Reproduced with permission. [161]opyright 2023, AIP Publishing; c) Utilization of the WaveY-Net-Based optimizer for ultra-fast, high-accuracy of forward prediction and inverse design.
this limitation by regularizing the latent space. [169,170]Figure 24a illustrates the architecture of the VAE, where the encoder does not output the vector of the latent space but the mean value μ and standard deviation, representing the spatial distribution.The reparameterization technique is applied to generate the latent vectors via sampling from this probability density function, which can be represented as follows: where the variable  is subjected to the Gauss distribution.The decoder now possesses a generative capacity that maps the potential vector to the original space to generate new data.To regularize the latent space, a constraint is imposed on this potential distribution during training to force it to obey a normal distribution.Therefore, the training loss of VAE contains both reconstruction and regularization losses.The reconstruction loss, like the autoencoder, is the mean square error or binary cross-entropy error of the original input and reconstruction output, while the regularization loss is the Kullback-Leibler divergence between the latent space distribution with the Gaussian distribution. [170]aw VAE models can only generate layouts similar to the training set patterns but cannot design the device to meet the target response, preventing its direct application to photonic research.A conditional variational autoencoder (CVAE) was introduced to address this issue.[174][175][176] Among them, the most detailed illustration is given in the literature. [172]This paper refined their method several times, resulting in the design of SOI-based power splitters with an ultra-compact footprint of 2.25 μm × 2.25 μm and ≈90% transmission efficiency over 1250-1800 nm wavelengths.The optimization region was a square with a side length of 2.25 μm and contained 20 × 20-hole vectors.To achieve a larger optimization space, the diameter of each hole was not uniform but could change from 42 to 77 nm, corresponding to the output value of the neural network from 0.3 to 1. Values less than 0.3 indicate no holes.Figure 25 shows the strategies they adopted for the inverse design.Figure 25a slightly differs from the general VAE, where the device structure and optical features were encoded separately according to their respective paths, composing the latent space.Once the network was well-trained, the conjunction of the target optical response and the data sampled in the Gaussian space was set as a latent vector and fed into the decoder to obtain design parameters.However, the addition of the response coding term destroys the Gaussian distribution property of the latent space, resulting in the partial clustering distribution of the potential vectors in the sample space, exhibiting a non-negligible impact on the performance of the generated device.An adversarial block was added to the model to avoid doubts, which could isolate the response coding from the latent space, [177] as demonstrated in Figure 25b.Besides, two by-paths were added; one mapped the response reverse encoding to the input, making up a dual input channel with the original device parameters; the other associated the potential space and the adversarial block for eliminating the response encoding component in the latent space.The model performance was notably improved in this way.Furthermore, they employed an active learning strategy in the model, which flowchart is shown in Figure 25c.In this approach, new device patterns with different split ratios are generated using the initially trained A-CVAE model, and labels are added to these patterns based on the optical response of the FDTD simulation.Afterward, these new data with labels were appended to the dataset for the next round of accurate training, significantly improving model performance (see Figure 25d).As a novel design method, several parallel comparisons were made with similar cases using other optimization schemes, including PSO, [15] the ADJ method, [57,70] and the fast search method. [30]The results showed that this new method designs devices with significant advantages in bandwidth, footprint, and insert loss.Another example of VAE combined with the ADJ algorithm for the inverse design of a nanophotonic power splitter was reported in Ref. [178] Besides power splitters, [173][174][175][176]178] VAE has more recently achieved notable success in other silicon photonic devices, like digital multimode interference waveguides [179] and the Starshot lightsail, [180] see Figure 26a-d.

Generative Adversarial Network (GAN)
Another representative generative model, GAN, is a prominent framework that can realize better results than VAE.A GAN mainly comprises two networks: a generator and a discriminator, illustrated in Figure 24b.The generator creates new data by mimicking the features of the training set, while the discriminator tries to distinguish whether the data are true (from the training set) or false (from the generator). [184]The Jensen-Shannon (JS) divergence [185] was used to evaluate the difference between the two distributions of the true and false data.The loss function was defined as the JS scatter between two distributions, minimizing the JS scatter during training to make the generated data look more realistic and fool the discriminator.Considering the discriminator, the JS divergence needed to be maximized to improve the discriminative capacity.The training generators and discriminators were alternately trained in a game-theoretic manner [186] to perform the minimax of the JS matrix, and their faking and discriminating abilities were gradually enhanced during adversarial. [187]Eventually, the generator produced data features that the discriminator could no longer discriminate, reaching the Nash equilibrium. [188]left: the computational graph for local adjoint optimization using WaveY-Net to maximize +1 order diffraction efficiency.Upper right: utilizing FDFD, WaveY-Net, and data-only UNet to calculate the adjoint gradients of a randomly sampled device layout, respectively, the WaveY-Net achieves almost the same effect as FDFD.Lower left (two graphs): optimization pathways and outcomes for local adjoint optimizations performed utilizing FDFD solver and WaveY-Net.Lower right (two graphs): performance statistics histogram of 100 optimizations using both FDFD solver and WaveY-Net.Reproduced with permission. [162]Copyright 2022, American Chemical Society; d) The research of the PINNs homogenization of a Vogel spiral array in the radiative (scattering).Left: the Vogel spiral pattern.Left center: the electromagnetic field distribution is simulated based on finite element method (FEM).Right center: the permittivity distribution derived from PINN according to the Left center's FEM pattern.Right: FEM verified the electric field distribution of the Right center's pattern.Reproduced with permission. [163]Copyright 2020, The Optical Society.With the ability to explore the underlying laws of complex data, GANs have been applied to discover and design silicon photonics.In particular, Jiang and Wen et al. proposed a new design strategy with the idea to incorporate a GAN with topology optimization methods and acquired a sequence of success in the metagrating inverse design with this method, [75,[181][182][183] as depicted in Figure 26e-g.They presented a conditional generative neural network that could produce an ensemble of topologyoptimized patterns with high performance for a wide range of deflection angles and operating wavelengths. [183]The architecture Figure 25.Conditional variational autoencoder (CVAE) model for designing nanopatterned power splitters, [172] a) Common CVAE model; b) Adversarial CVAE (A-CVAE); c) A-CVAE-based active learning strategy; d) FOM comparison of different CVAE models.Reproduced with permission. [172]][176]178] Copyright 2021, SPIE, Copyright 2021, OSA Advanced Photonics Congress; c) Multimode interference.Reproduced with permission. [179]Copyright 2021, American Chemical Society; d) Startshot lightsail.Reproduced with permission. [180]opyright 2022, American Chemical Society, Copyright 2020, American Chemical Society, Copyright 2019, American Chemical Society; e-g) Metasurfaces.Reproduced with permission.[75,[181][182][183] Copyright 2019, American Chemical Society.
of the global topology-optimizing network (GLOnet) is shown in Figure 27a.Its input comprises the operating wavelength, reflection angle, and random noise, where the introduction of stochastic noise allows the diversity of the generated devices.With only a small amount of existing high-performance patterns, the generator creates a batch of new design layouts by mimicking the main features of these devices.This model still works for those wavelengths and angles outside the training set and shows better results than the method of geometrically stretching the datasets' devices.Among the generated layouts, better ones were selected and complemented to the dataset for training a secondgeneration GAN to boost the model capacity.Furthermore, iteration topology optimization was executed for excellent patterns generated by the second-generation model, leading to further efficiency enhancement for individual devices (Figure 27b) and the overall population across a broad parameter space (Figure 27c).This process considers robustness to manufacturing defects and imposes physical constraints on the design.The global topology optimization approach focuses on learning and generalizing the essential characteristics of high-performance devices, efficiently generating a metasurface with the target response, rather than sightless running EM simulation on arbitrary structures, most of which are far from the optimal solution, leading to a vast of invalid computational overhead.Benchmarking results show that the GAN-generated and refined method was five times faster than the iterative-only optimization for designing devices with equivalent effects.

Knowledge Discovery
Another important area in DL-assisted methods research is knowledge discovery and providing intuition about the underlying physics through dimensionality reduction and manifold learning.Knowledge discovery effectively extracts the most valu-able information from extensive datasets through a deep analysis of complex data structures, precise prediction of main features, and profound insight into physical phenomena.In other words, knowledge discovery helps identify the key factors that substantially influence photonic behavior from a wide array of design parameters.Within this context, dimension reduction methods like manifold learning emerge as an indispensable knowledge discovery component.Manifold learning is employed as a specialized machine learning technique to map high-dimensional data into lower-dimensional space, enabling a more coherent understanding and visualization of the data's intrinsic structure and interrelationships.Manifold learning provides a framework that enhances knowledge discovery by facilitation an intuitive interpretation of complex patterns within datasets.
[191] Typically, they ingeniously employ dimension reduction techniques to significantly reduce both the design and response space in nanophotonic systems. [189]This advanced methodology serves a dual purpose: it remarkably mitigates computational complexity and concurrently establishes a robust analytical framework that offers a meticulous understanding of the complex interaction between light and matter.Specifically, the study utilizes an autoencoder to preserve the natural nonlinear features of the response space, as shown in the upper part of Figure 28a.The autoencoder achieves a minimal MSE of less than 0.05 for the response space through rigorous optimization algorithms while retaining essential physical insights.As Figure 28b shows, the reconstruction error revels that the autoencoder's performance surpasses that of traditional methods, such as principal component analysis (PCA) and PCA (KPCA), particularly in maintaining the nonlinear properties of the response space.Moreover, the pseudo-encoder architecture was adopted and after (right panel) topology refinement for different wavelengths and reflection angles.Reproduced with permisssion. [183]Copyright 2019, American Chemical Society.
to reduce the dimension of the design space and subsequently correlate with the reduced response space, as shown in the lower part of Figure 28a.By extracting the weights from the first layer of the pseudo-encoder (see Figure 28c), the sensitivity of the design response to various design parameters can be intuitively analyzed (see Figure 28d).This method unveils the fundamental physical principles between light and material within nanostructures.More recently, a novel approach based on manifold learning was proposed for knowledge discovery and nanostructure inverse design with minimum geometric complexity, [190] allowing us a deep underlying of physics and interactions within the com-plex nanostructure systems, as shown in Figure 29a.The integration of manifold learning and knowledge discovery significantly advanced the field, offering a more efficient and insightful framework for the design and analysis of photonic systems.Moreover, some other achievements in the knowledge discovery for nanophotonic research also deserve attention, like finding the range of response of nanophotonic structures and the feasibility of the desired response (Figure 29b), [191] characterizing the multi-parameter design space (Figure 29c), [192] and understanding the impact of design parameters on the electromagnetic response (Figure 29d). [193]igure 28.Methodology for uncovering the fundamental physics of light-matter interaction in nanostructures through the dimensionality reduction technique. [189]a) Upper part: an autoencoder, used for the dimensionality reduction of response space; Lower part: a pseudo-encoder, used for reducing the design space and subsequently maps to the response space; b) The MSE after reconstruction using different dimensionality reduction techniques; c) The architecture of the pseudo-encoder; d) The connection weight between the pseudo-encoder input layer and the first hidden layer, revealing the influence of various design parameters on the response characteristics.Reproduced with permission. [189]Copyright 2019, John Wiley and Sons.

Analysis and Discussion
We have witnessed exciting progress in this rising research field as many devices with more compact structures and superior performances have been demonstrated during the last few years, making the intelligent algorithm a promising candidate to precisely manipulate light at the subwavelength scale.The methods still have "one last mile" to go before they can finally deliver a suitable roadmap for designing large-scale devices or systems with high complexity.In this chapter, challenges and prospects of inversely designing silicon photonic devices are discussed from several perspectives of views.

Computational Resource Requisition
So far, most reported studies have restricted their design area to less than 100 um 2 .This is not quite a decent number since the size of many important photonic building blocks, such as array waveguide grating(AWGs), star couples(SCs), multimode interference(MMI) structures and meta-structured gratings are usually at the level of tens of thousands square micrometers.The reason why inverse methods have not expanded their access to large-size device design could be mainly caused by the explosively increasing computation resource requirement.
It is undeniable that inverse design methods are computationally intensive.Numerous EM simulations must be executed during one designing process, whether optimization methods or deep learning algorithms are adopted.Meanwhile, EM simulation itself could be resource-consuming if the area or amount of meshes required to simulate is considerably large.It becomes tougher to design devices as size and complexity are increasing.In that sense, computational resource consumption is definitely one of the most critical criteria to evaluate the performance of an inverse method.
In general, the total time consumption of an optimizationbased inverse design procedure can be expressed as a product of multiplications as follows: (20)   where N iterations represents a number of simulations that are required in one round of iterations, which takes different values for different algorithms (for DBS, it means the number of pixels; for GA and PSO, it equal to the population size; and for the adjoint method, it is 2, because the state of all pixels can be updated via only one forward simulation and one accompanying simulation), N rounds indicates the total number of rounds that required to find the solution (except for GA and PSO, it is the number of generations), t simulation represents the time taken for a single EM simulation.
The expression for the deep learning-based method can be described as (21)   where N Dataset represents the number of datasets that are required to train the networks, T train indicates the time taken training the network.
In Table 1., a summary of the time consumption for diverse optimization methods used for designing silicon photonic devices Left: the distribution of reflective properties within the latent response space.Right: the convex latent space mapped from the feasible response.Reproduced with permission. [190]Copyright 2022, The Authors; b) The representation of the convex hulls in reduced 2D (Left) and 3D (Middle) response space.Right: feasible non-convex response spaces identified through the SVM algorithm.Reproduced with permission. [194]Copyright 2019, Wiley; c) Strategies for characterizing multiparameter space.Left: a sparse collection of good designs generated by optimizing the original high-dimensional design space.Middle: the dimensionality reduction technique is employed to map good designs onto a low-dimensional subspace.Right: the metric characterization in the low-dimensional space.Reproduced with permission. [192]Copyright 2019 Springer Nature; d) By learning the physical relationships between structural parameters and response characteristics, new designs with novel target attributes are constructed.Reproduced with permission. [193]Copyright 2020, American Chemical Society.DBS [195] 1.2 × 5 9 6 h 1 6 h μm −2 DBS [196] 3 × 3 300 h 33.3 h μm −2 GA [35] 2.16 × 2.16 48 h 10.3 h μm −2 PSO [51] 4.8 × 4.8 42 h 1.8 h μm −2 ADJ [ 57] 2.4 × 3 7 h 1 .0 h μm −2 ADJ & LST [86] 3.8 × 2.5 2 h 0.2 h μm −2 DTO [197] 1.4 × 1.4 12 h 6.1 h μm −2 DTO & LST & ADJ [ 100] 2.8 × 2.8 36 h 4.6 h μm −2 FCL [152] ∖ 61 h for 9190 datasets / FCL [154] 4.25 × 15.38 709 h for 8510 datasets 130.2 n μm −2 or 10.8 h μm −2 CVAE [172] 2.25 × 2.25 16 000 datasets 3160.4 n μm −2 -VAE [ 179] 33 × 6 >10 000 datasets >50.5 n μm −2 GAN [183] ∖ 10 000 datasets after multiple topology optimization / PGGANs [181] ∖ 178 h for 9350 datasets / is provided.A little calculation has been done to derive the time consumed per unit area and total number of simulations used for each demonstration so that a direct comparison can be made.According to the table, it is quite obvious that the DBS method is one of the most computationally expansive algorithms as it requires updating optical response for each flip of pixel and may take hundreds of EM simulations to complete just one round of iteration.Heuristic methods such as GA and PSO are not developed to address this issue, so these methods don't show significant improvement in terms of computation efficiency.Some hybrid algorithms that take advantage of adjoint methods and topologically optimization strategies may require much fewer EM simulations in one round of iterations while converging very fast (requiring fewer rounds of iterations before the loss function stabilized) benefiting from their gradient-based nature.As a result, much less computational resource is required when these methods are adopted.
For deep learning-based inverse design methods, great resource consumption mainly lies in the dataset generation process, as sufficient EM simulation data is required to train neural networks to converge to an acceptable level.Take a supervised learning model as an example, [198] at least 5000 samples for each class of labels need to be given to satisfy the basic requirement.The performance of other deep learning-based studies has also been summarized in Table 1.A point worth mentioning here is the unique advantage of deep learning that its model is generic.Once the network is trained, design parameters can be calculated in a very small time interval, giving a new desired response without reiterating new optimizations.Such uniqueness has usually been considered a critical point to reveal the excellence of deep learning-based inverse methods.Yet, the claim might be overestimated at a practical level.As generic models are asked to predict more general situations.Larger parameter space and response space are required to be covered.The network structure has to be tuned to be larger and more complex to adapt to the change as well.Eventually, the scale of a dataset required to train a more general model would explode to an astonishing level, making the computational resource investment huge.It is then hard to tell whether training such a complex network is still a smart choice.
Finally, the time spent per unit EM simulation (t simulation ) in the expressions might be a term of great importance.For most stateof-the-art rigorous vector analysis-based EM simulation techniques, such as finite difference and finite element methods in both the time domain and frequency domain.The computation resource needed to be invested usually has square relations with the simulation area (and cubic relation with the volume if 3D EM simulations are applied).In that case, this term becomes the fastest-growing term in expressions (20) and (21).
Generally, a nearly quadratically increasing trend of computing resources is expected with respect to the design area, which greatly hinders the methods from being applied to large-size device designing.Advances from the inverse algorithm and EM simulation technique field shall become the intrinsic driving force to boost this research field.

How to Achieve Optimal Performance Under the Current Process State
Smaller pixel size means greater designing of DOF and more precise manipulation of light, which is supposed to yield better simulation results.While this behavior implies more elaborate feature size and greater fabrication difficulties.Lithography exposure, etching, and other uncertainties can result in differences between the design layout and the manufactured wafer, significantly degrading the device's performance, which is most obvious for small features.The current silicon photonic platform mainly adopts 193 nm photolithography, corresponding to a minimum processing feature size of ≈0-180 nm.Consequently, most commercially available silicon photonic devices are concentrated ≈100 nm, as shown in Table 2.
design of multifunctional devices very challenging.Moreover, due to the dull etching profile, the devices usually exhibit strong scattering, accompanied by insertion loss and crosstalk.All these factors markedly affect functional performance.Analog devices with ultra-fine pixel size could be constructed to almost any complex shape, which is expected to achieve high efficiency and multifunctionality.However, just as we mentioned earlier, this extremely small pixel defies current fabrication processes; even if excellent performance is achieved in simulation, the actual device may not be so.Several measures can be undertaken to alleviate the decline in the performance of analog devices caused by fabrication uncertainty.As mentioned earlier, one solution is to adopt the LST method to constrain the minimum feature size and detect and eliminate small gaps and bridges. [86]Another solution is the density filter strategy, commonly seen in density topology optimization methods, [84,199] where various threshold factors are set to predict different degrees of erosion/dilation patterns, reducing the difference between the simulated pattern and the fabricated device.Alternatively, Gaussian filters were added to the generative model, eliminating fine features in the generated devices. [75]One not-so-well-known but more fundamental solution is to include the fabrication process in the design flow. [200,201]entor Graphics offers a lithography computational simulation tool that simulates the post-lithography pattern predicted by the model, [202] making the simulation results more realistic.

Multi-Attribute Comprehensive Comparison
The photonic inverse design aims to find the optimal device parameter.But before that, it is more pivotal to acquire which design method is the most suitable for the specific scenario to get twice the result with half the effort.Figure 30 compares the properties of different inverse development strategies from several perspectives for researchers to reference.DBS algorithm is easy to implement and does not need to deploy complex framework or optimization strategies.However, it can only update one pixel at a time, which is beyond the ability to tackle devices with large designs of DOF or size.Additionally, it is hard to produce an excellent design since it lacks global search capability.A heuristic algorithm is possible to avoid getting trapped in local optima and win powerful global search capability to find better designs.However, the drawback is also significant: it may require a large population and multiple generations to find satisfying designs, particularly when the objective function or design space becomes complex.Faced with such a high-dimensional problem, the gradient algorithm is a suitable candidate.Adjoint method provides accurate gradient information for each point in the design region through only two EM simulations, and it updates the whole design parameters with the guidance of the gradient, which is an extremely efficient optimization technique (recall the DBS algorithm, which updates only one parameter per simulation, is rather clumsy and inefficient; the heuristic algorithm is no better, as it needs to evaluate a large number of candidates for comparison and selection, of which a small number of simulations are helpful, while the others are meaningless).This powerful optimization tool results in faster convergence and more refined designs, especially for large devices.Moreover, it is feasible to cooperate with LST and DTO algorithms to optimize feature size to meet the state-of-the-art production process.From the perspective of economy and performance, the gradient algorithm is the optimal candidate, and it meets the large device development requirements in the long view.However, it is not flawless, this gradient strategy perhaps converges to a poor local optimum, depending on the initialization of the design pattern.Deep learning has a prominent strength in that once the model is trained, it can be used repeatedly without the need to re-iterate a new optimization.Additionally, its second-order inverse design speed is unmatched by other optimization methods.Moreover, its global search capacity is recognized, and it can produce designs with better performance than those in the dataset.The main drawback of deep learning is that its one-time cost is quite huge, and we believe it is wiser to use conventional simulation and optimization methods if they can solve the problem with an equal or lesser computational expenditure.Another policy with less reliance on datasets, reinforcement learning, offers a more compelling approach to photonic research.It learns through interactions with the environment, striking a balance between exploring new designs and leveraging known optimized designs, dynamically making decisions throughout the design process, and progressively optimizing the design parameters.To date, reinforcement learning has showcased impressive design proficiencies in photonics.This includes optimizing the color generation from dielectric nanostructures, [209] inverse designing of wavelength demultiplexers and polarizations, [210] refining the structure of freeform metagrating deflectors, [211] and even developing an optimization package rooted in the policy gradient strategy of reinforcement learning. [212]Such advancements undoubtedly merit attention.Overall, these methods are both competitive and cooperative with each other.Developing more advanced hybrid algorithms will be essential to draw on each other's strengths.

The Importance of Initial Guess for Optimization Methods
Given the intricate mapping between the design space and the response properties, the design space exhibits non-convex characteristics.Consequently, varying initial conditional can potentially impact the final result.The initial guess is predominantly determined by the designer's experience and intuition for the conventional forward design method driven based on intuition.Then, the subsequent manual optimization is anchored on the initial condition.This approach exhibits low robustness, and the final design is sensitive to the initial guess.The optimization process heavily relies on gradient information for gradient-based optimization techniques like ADJ, LST, and DTO.As a result, varying initial guesses can lead the algorithms to converge to different local optima.Even minor variations in the initial design can result in drastically different final designs.These gradient-based methods demonstrate heightened sensitivity to the choice of the initial guess and exhibit reduced robustness.Although the DBS method is not a gradientbased approach, its behavior is similar to gradient algorithms, with the final design heavily relying on the initial guess.Heuristic algorithms, such as GA and PSO, exhibit a reduced dependency on initial selections.These methods utilize a population to explore the design space comprehensively, enhancing search diversity and lessening the reliance on individuals' initialization, hence exhibiting strong robustness.
Considering specific requirements, if we do not pursue the global optimum and multiple local optima with the design space satisfying the requirements, selecting initial conditions becomes less critical.Multiple initializations emerge as a potent strategy if the global optimal solution is pursued.It searches from multiple different starting points, enhancing the algorithm's robustness and significantly reducing the uncertainty effect of a single initialization on the final design.For a designer with a wealth of experience, taking the most promising design as the initial guess is also a wise choice.

From Discrete Devices to Continuous Devices
Currently, the mainstream of photonic inverse design strategy tends to pixelate the design space, treating each pixel as a binary unit.In such a setup, each pixel's material attributes (e.g., refractive index, absorption, etc.) can only have two "states" during the optimization and fabrication process.The contrast of refractive index between the two states is usually set to be high, making the perturbation strong enough to effectively mold the light flowing through the area formed by binary pixels array.This is an excellent strategy in terms of expanding design space to much higher dimensions while keeping design and fabrication complexity still in a reasonable range.Strong perturbation provides sharp tuning from every pixel, ensuring the parameters' space is large enough to correspondingly cover the desired points in the response space.Yet, strong perturbation and binary mode also mean discrete design space and discrete solutions in the response space.The optimization points are more likely positioned among discrete response points and sometimes, they are missing if a high-dimensional surface at the area is sharp enough.
The discretization in parameter space could be classified into two categories, namely geometric discretization and value discretization.For geometric discretization, size, shape and spatial arrangement of pixels will impact the discrete points distribution manner in optical response space.As a result, in order to avoid missing solutions and further improve the design performance, it is necessary to develop some methods systematically tuning these super-parameters.
On the one hand, each pixel or aperture etched on the digital device with the same etch size severely limits its operation space.They can be designed and fabricated into arbitrary values of size or diameter as long as the feature size meets the process requirements.Taking etching geometry size into successive values enables the device to produce more plentiful optical features, which are more promising to fit the goal response.Some works for developing devices with continuous geometries are shared in Table 2. [15,50,70,205] It can be found that they are typically designed with a dozen to tens of DOF, much less than other digital devices, which means they need to be optimized with fewer parameters to achieve comparable performance.Scilicet, with the same as DOF, the continuous design space leads to better device designs overall. [175]n the other hand, the material's attributes value discretizations are often realized via some irreversible process such as etching and doping.The optical responses are fixed once devices are fabricated, resulting in a customized device, greatly limiting its versatility.Meanwhile, as the coverage of design space might be overmuch due to the applied strong perturbation of value dis-cretization, there will be tens or even hundreds of local optimization points, each of which is able to satisfy the required device performance.In that case, it seems unnecessary and redundant to introduce strong perturbed discretized states to cover such a large design space.The last but the most crucial issue that hinders the combination of inverse-design methods and silicon photonics is the lack of reconfigurability of inverse-designed devices.As mentioned before, involved strong perturbations are usually realized via irreversible processes such as etching and doping.The optical responses are fixed once devices are fabricated, resulting in a customized device, largely limiting its versatility.Fortunately, pioneers are aware of this problem and are pinning their hopes on developing dynamically tunable silicon photonic devices.Cheng et al. implemented a photonics emulator with dynamic adaptive modulation on the SOI platform, as presented in Figure 31a.They experimentally demonstrated multiple typical applications of the photonic simulator, including optical multiple-inputmultiple-output (MIMO) descrambler, optical vector-matrix multiplication, and tunable wavelength descrambler.Here, the manipulation of the light does not rely on etching complex patterns on the surface but on changing the local effective refractive index by thermal modulation of the heater electrode.Its optical response is obtained through the physical process of light propagation, which cleverly avoids burdensome EM simulations.In conjunction with the gradient descendent algorithm, this photonic emulator enables real-time parameter search and quickly iterates for design requirements.Moreover, the emulator can be reconfigured to achieve different functions, which is not comparable to fixed silicon photonic devices.Their work breaks through conventional photonics design and simulation mindset and lays the groundwork for the future development of novel programmable silicon photonic devices.For larger-scale programmable optoelectronic integrated circuits, interested readers can follow the work of Bogaerts et al. [213] Technology stack for programmable photonic circuits.Figure 31b presents the technology stack for programmable photonic circuits of their work, including the latest photonic and circuit system architectures, as well as software and hardware cooperation control strategies.The parameter configuration of the drive electrode is achieved by forward design.However, this clumsy approach seems less than perfect in the long term since this crude modulation technique cannot precisely manipulate light.Moreover, it will encounter bottlenecks sooner or later with the programmable devices developing toward large-scale integration.Configuring the inverse design algorithm is more feasible, and one can refer to the photonic emulator configuration scheme described above.

Limitations of Generative Model for Photonic Inverse Design
Although previous reviews have mentioned that the generative models may alleviate the burden of dataset collection to some level, [215] it might be too early to draw a conclusion.Utilizing a generative model for photonics development is a very frontier technique.There are a few concerns that should be considered.
First, taking high-performance devices to train generative models remains a crucial question that has not been adequately explored.The performance of generative models mainly relies Reproduced with permission. [214]Copyright 2023, American Chemical Society; b) Left up: diagram of the core component of the programmable photonic circuit; Left down: schematic diagram of a photonic chip integrated with analog and digital control electronics; Right middle: optimized algorithm and user application programming interface (API).Reproduced with permission. [213]Copyright 2020, Springer Nature.
on the quality and quantity of the training data.However, highperformance data cannot be directly generated.Instead, significant computational resources are required to produce vast initial datasets, from which excellent data is selected to train the generative models.This practice is inefficient, yet necessary.While the dataset ostensibly used for training for the generative model may seem not too much, the overall consumption remains substantial, which complicates the development process and adversely impacts its versatility.Additionally, if the training data are intentionally biased toward "good" designs, the model might overfit to these specific designs, i.e., the model might perform well on similar data but struggle to generalize to new, unseen designs.Furthermore, selecting a "good" design is inherently challenging since it is hard to say what kind is "good".A clear standard is necessary to evaluate what constitutes a "good "design.

The Importance of Selecting Loss or Metric in Training and Optimization
In inverse engineering, both loss and metrics are methods for evaluating the model's performance and represent the indicators of our primary concern.They are inherently very similar, with only slight variations in their applications.The loss quantifies the deviation between the model's predictions and the actual values, providing a specific training directive for networks and ensuring their progressive convergence during training.The metric is a more intuitive indicator for evaluating device performance, such as transmittance, reflectance, or matching rate, typically concerning the optimization process.In most cases, loss and metric are common.The loss or metric function definition emerges as a pivotal research question for the photonics inverse design.
Primarily, the loss must be differentiable to facilitate weight parameter adjustments through gradient-based techniques.The gradient, which represents the derivative of the loss with respect to model parameters, provides crucial information about both the direction and magnitude of parameter updates.These essential gradients become incalculable without a differentiable loss function, impeding the model optimization.The metric also needs to be differentiable for optimization algorithms, especially gradientbased ones.
When selecting a loss or metric, one should meticulously consider its inherent characteristics and suitable scenes.Mean Squared Error (MSE) is frequently employed in regression tasks but is sensitive to outliers.Cross-entropy loss is apt for classification, directly targeting probability estimations, yet may falter with imbalanced datasets.Hinge Loss, aiming to maximize the margin between classes, is a staple in Support Vector Machines but is primarily designed for binary classification.Conversely, Huber Loss offers robustness against outliers, balancing the attributes of MSE and Mean Absolute Error (MAE).Selecting an appropriate loss or matric ultimately hinges on the balance based on specific modeling objectives and data characteristics.
Although plentiful loss or metric functions are available, it does not necessarily imply that these existing functions can aptly characterize the performance metrics of photonic devices.A situation can be imagined: two microcavity resonators with similar modal structures having slightly differing resonant wavelengths.
When the design objective is to achieve a high-Q microcavity with a specific resonant wavelength, employing MSE as a performance metric might be unsuitable since even minor deviations in wavelength can result in significant increases in the MSE value.The work of Mohammadreza et al. [216] may provide insight for photonic researchers.They proposed a metric method that combines triplet loss and MSE, offering a more rational similarity measurement compared to the commonly used MSE and MAE methods.Figure 32 illustrates the method for metric learning, where different resonance line shapes serve as different response classes rather than using resonance wavelength ( 1 or  2 in the figure) as identifiers for the response categories.The triplet loss is employed to ensure that the spectrum from different classes is separated, which is realized via three Siamese networks with the same weights, and the MSE loss is employed to cluster spectra with similar responses (i.e., those from the same class) closely within the low-dimensional embedding space.Similarly, in Ref. [214], a customized cost function was employed to characterize device performance and assess the alignment between the current function and design targets.These methods provide an interpretable visual representation of the intricate response in photonics.Additionally, they are compatibility with current deep-learning technology will further prompt research in intelligent photonics.
Moreover, for both deep learning and optimization techniques, it is better to incorporate physical constraints into the metric or loss to ensure that the design methodologies produced by the model follow the physical laws, thereby eliminating designs that are either hard to fabricate or exhibit unacceptable performance.As previously mentioned in Ref. [86], the curvature penalty term was incorporated in their metric function, and the experimental assessment showed that the device was robust to manufacturing error.Integrating energy constraints into the design process could simplify development and prevent unnecessary detours if the photonic system obeys a specific energy conservation.Besides physical constrains, regularization is another constraint for neural network training.Incorporating regularization terms into the loss function helps prevent the model from becoming overly complex and overfitting the training data, ensuring effective generalization to process new data and maintaining a concise and robust mode while avoiding unnecessary noise and misleading features.These constraints not only ensure the practical feasibility and optimal selection of the photonic devices but also offer deeper insight and guidance, leading to more efficient and reliable designs.

Non-Uniformity Distribution of the Dataset in the Response Space
[221] In general regression problems, dataset imbalance is also quite important, but rarely studied. [222]For deep learning accelerated photonic device inverse design, it is a huge pitfall that the current state of the art in photonics research cannot propose an adequate Here, the shape of the resonance lines is used to differentiate response categories, rather than the resonance wavelength. [216]eproduced with permisssion. [216]Copyright 2023, American Chemical Society.
solution for this thorny issue.Here, for the first time, we analyze its negative impact on the accuracy of the forward prediction networks.Suppose a 1 × 4 power splitter with a design area represented as a QR code structure, as shown in Figure 33a.A training set is collected for building a forward predictive network, where the design parameters are the binary code, and the label is the splitting ratio of each port.The spatial distribution in the design parameters is uniform, as shown in Figure 33b.However, the dataset distribution is not uniform under the response space.The dataset is concentrated around split ratios, where its realization is straightforward, while it is sparse around uncommon ratios, Figure 33c.Some ratios are not even possible, as there is no data distribution.Inhomogeneity of the dataset leads to scenarios in which the network is accurate only in data-dense regions and untruthfulness in other areas.More seriously, the feedback from the test set does not offer any help on this issue.The neural network shows excellent accuracy on the test set, which is a blind optimism generated by illusion, but the network is still inaccurate.Since the test set was drawn from the dataset, it has the same distribution characteristics as the dataset.In the response space, the test set is also sparse in the region where the training set is sparsely distributed.As a result, testing is always done around dense regions, while sparse areas are rarely involved, and favorable test results do not equal real network performance.A vast of supplement datasets cannot significantly alleviate this issue since the additional datasets are still concentrated in the intensive areas, and the sparse regions remain as before, having a half-hearted effect.Although only the case of beam splitters is cited, this unfavorable situation can also occur in other photonic devices.The preparation of uniformly distributed datasets is an urgent problem to be solved in the future.

Other Obstacles to the Development of Silicon Photonics
Several other factors still plague the development of silicon photonics design methodologies: no algorithm can achieve a global optimal solution, we can only improve design methods   [223] Copyright 2022, John Wiley and Sons; b) SiN photonic spin selector.Reproduced with permission. [224]Copyright 2021, Springer Nature; c) SiN metalenses.Reproduced with permission. [225]Copyright 2020, American Chemical Society; d) SiN grating.Reproduced with permission. [226]Copyright 2023, IEEE; e) SiC nanophotonic resonator.Reproduced with permission. [227]Copyright 2019, Springer Nature; f) SiC reflector.Reproduced with permission. [228]Copyright 2022 The Optical Society; g) SiC optical cavity.Reproduced with permission. [237]Copyright 2023, Springer Nature; h) SiGe power amplifier.Reproduced with permission. [229]Copyright 2022, IEEE.
to approach the optimal solution; the silicon photonics domain has not yet formed a regularized designed tool and a process flow, which needs to be addressed to advance the development of scale and standardization in the silicon photonics industry; current inverse design techniques aim at the device level, while multi-device on-chip integration cases are more significant from the application aspect.How to achieve development on the system level is a crucial issue to be studied.

Inverse Design of Other Silicon-Based Materials
This review presented a series of silicon photonic devices based on an SOI platform and their advanced design methods.In addition to the SOI system, silicon nitride (SiN), with a high refractive index, high transmittance, excellent mechanical properties, and compatibility with the CMOS process, is widely adopted in optical waveguide and nanophotonic devices.Figure 34a presents an integrated nonlinear optical switch optimized using a gradientdescendent algorithm combined with the ADJ method and fabricated on a SiN platform. [223]Figure 34b shows a photonic spin selector optimized by the DBS algorithm, with a measured magnetic sensitivity of 700 pt/√Hz, which is used to resolve the spin direction of photonics. [224]Additionally, metalenses [225] and grating couplers [226] are optimized by inverse design algorithms developed based on a SiN platform, as exhibited in Figure 34c,d.Silicon carbide (SiC) is also a promising photonic material for high-temperature, high-frequency, and highpower scenarios due to its excellent electrical, optical, and thermal properties.Figure 34e-g, respectively, present nanophotonic resonator, [227] reflector, [228] and optical cavity, [229] developed on an SiC system.Figure 34h shows a SiGe-based power amplifier developed by deep learning. [229]Other materials, like Planar Light-wave Circuit (PLC), have been less studied for inverse design due to their low refractive index (1.47 for PLC at 1550 nm) and weak light modulation, resulting in a large footprint and extra computing overhead.

Summary
Silicon photonics has outstanding advantages of low power consumption, broadband, high speed, and high integration density, becoming a disruptive technology of the last decades.The inverse design of silicon photonic devices is a rapidly evolving research area with significant potential for enabling the development of high-performance photonic devices with reduced design time and costs.Inverse design methods for silicon photonic devices are powerful techniques that allow the design of devices by specifying the desired output and optimizing the device parameters to achieve the desired result.The choice of the inverse design method depends on the complexity of the design problem, design space, and desired outcome.Heuristic methods are suitable for optimizing simple and well-defined structures.Gradient optimization methods are suitable for optimizing complex structures with a large design space and can efficiently explore the entire design space.Deep learning methods are fast and can handle complex geometries.This article serves primarily for silicon photonics development but is not limited to silicon-based platforms, and most of the concepts can be extended to photonic devices with other material systems.[232][233][234][235][236] Future research in this area should focus on developing more efficient and scalable optimization algorithms and machine learning models, integrating inverse design methods with the fabrication process, developing new data generation techniques, and improving our understanding of the underlying physics and materials science.Finally, we believe the current obstacles are only temporary, and more advanced methods will emerge to make silicon photonics design more efficient and intelligent.

Figure 2 .
Figure2.A dial illustration of silicon photonic inverse design methodology.The colored middle ring indicates the five major categories of the inverse design method.DBS, heuristic, and gradient algorithms are classical optimization algorithms, and discriminative and generative models are emerging deep-leaning design approaches.The outer sector lists a series of specific algorithms or models.The inner sector depicts the high-performance silicon photonic devices developed via inverse design and the corresponding optical field modulation plots or spectral response characteristic curves.Reproduced with permission.[195]Copyright 2017, The Optical Society.Reproduced with permission.[44]Copyright 2020, The Optical Society.Reproduced with permission.[86]Copyright 2017, Springer Nature.Reproduced with permission.[21]Copyright 2022, The Optical Society.Reproduced with permission.[179]Copyright 2021, American Chemical Society.

Figure 3 .
Figure3.Polarization beam splitter design by the DBS algorithm.[22]a) Device discretization into M×N cells, called "pixels", showing a single pixel material replacement process; b) DBS method optimization flowchart; c) Schematic of device geometry; d) Comparison of simulated and experimented results.Reproduced with permission.[22]Copyright 2015, Springer Nature.

Figure 12 .
Figure12.LST algorithm combined with the ADJ method for optimizing a Y-splitter.[70]a) LST representation and optimization strategy for photonic devices.b) The geometry of the optimization region.c) Coupling efficiency evolution during the optimization.d) Coupling efficiency from 1500 to 1600 nm of the final optimized structure.Reproduced with permission.[70]Copyright 2013, The Optical Society.

Figure 15 .
Figure15.On-chip wavelength demultiplexer (WDM) designed through a hybrid strategy based on the ADJ method, LST algorithm, and DTO approaches.[100]a) The process used in the DTO method for designing silicon photonic devices.Left: Grayscale initialization.Middle: Continuous density optimization.Right: Binary layout.b) Left: SEM images of the fabricated WDM.The middle and right panels are the S-parameters from the FDTD simulation and experimental measurement, respectively.Reproduced with permisssion.[100]Copyright 2015, Springer Nature.

Figure 18 .
Figure 18.Demonstration of silicon photonic research based on the FCN architecture.a) The fundamental framework of an FCN; b) Neuron node model; c) Network parameter update strategy based on gradient descent; d) Forward and inverse modeling of the beam splitter using FCNs.In the forward model, the design parameters are regarded as input, and the response properties are treated as labels, and vice versa in the inverse model.[138]Reproduction with permisssion.[138]Copyright 2019, Springer Nature.

Figure 19 .
Figure 19.Strategies for addressing non-unique mapping in inverse engineering.a) Multiple design configurations exhibit identical electromagnetic responses.Reproduced with permission.[140]Copyright 2018, American Chemical Society; b) A method that resolves the issue of multiple solutions in inverse designing through partitioning the training data.Reproduced with permission.[141]Copyright 2008, IEEE; c) A dimension reduction design methodology for electromagnetic nanostructures based on autoencoders.Up: a forward model consisting of cascaded a pseudo-encoder and a decoder, employed to map the reduced design space to the response space.[142]Middle is a meta-surface with configurable reflectivity,[142] and Down represents the multi-layer thin-film structures.[143]Both of them are developed through dimensionality techniques of the autoencoder; Copyright 2020, Springer Nature, and Copyright 2021, The Optical Society; d) A tandem neural network to solve the nonunique issue.During the training, weights in the forward model are fixed, while those in the inverse network are modified to minimize error.After the training, the design parameters are extracted from the intermediate layer;[140] e) The MDN technology for dealing with the issue of one-to-many mapping.Reproduced with permission.[146,147]Left and Middle: in the inverse engineering paradigm of MDN, response properties are mapped to multiple mixed Gaussian distributions that contain the design information, rather than to a single set of design parameters.Right: the MDN captures all degenerated through multimodal distribution, with mode strength indicated by strip opacity.Reproduction with permission.Copyright 2020, American Chemical Society.

Figure 21 .
Figure 21.Schematic of the CNN model.

Figure 22 .
Figure22.Nanophotonic power splitter design via CNN.[159] a) Overview of the CNN recognition and prediction process; b) Examples of randomly generated test datasets.c) Correlation coefficient between the target with predicted transmission spectra is 0.85 for the CNN and 0.16 for the FCN.Inset: train and test data distributions.Reproduced with permisssion.[159]Coyright 2019, CLEO.

Figure 23 .
Figure 23.PDEs solvers based on a physics-augmented neural network to accelerate the nanophotonic research.a) Schematic of a PINN for solving the diffusion equation with mixed boundary conditions.Reproduced with permission.[160]Copyright 2021, Society for Industrial and Mathmatics Publications; b) Methodology for Lens optimization via physic-driven neural network.Here, the initial lens design is fed into MaxwellNet, yielding EM response, and the objective function is derived via Green convolution and optimized through backpropagation until it converges.Reproduced with permission.[161]Copyright 2023, AIP Publishing; c) Utilization of the WaveY-Net-Based optimizer for ultra-fast, high-accuracy of forward prediction and inverse design.

Figure 27 .
Figure 27.Metagrating inverse design based on GANs. [183]a) The generative network optimizes topology, where devices with excellent performance are used for the next stage of generative model training; b) Left: the histogram of the efficiency distribution of random binary patterns, training set geometric stretch patterns, and GAN-generated patterns.Right: efficiency distributions after 30 rounds of topology refinement for 50 highest-efficiency devices selected in each GAN-generated device and the stretched training set; c) Efficiency comparison of GAN-generated patterns before (left panel)and after (right panel) topology refinement for different wavelengths and reflection angles.Reproduced with permisssion.[183]Copyright 2019, American Chemical Society.

Figure 29 .
Figure 29.Knowledge discovery methods for nanophotonic research.a) Visualization of the reflective response space.Left: the distribution of reflective properties within the latent response space.Right: the convex latent space mapped from the feasible response.Reproduced with permission.[190]Copyright 2022, The Authors; b) The representation of the convex hulls in reduced 2D (Left) and 3D (Middle) response space.Right: feasible non-convex response spaces identified through the SVM algorithm.Reproduced with permission.[194]Copyright 2019, Wiley; c) Strategies for characterizing multiparameter space.Left: a sparse collection of good designs generated by optimizing the original high-dimensional design space.Middle: the dimensionality reduction technique is employed to map good designs onto a low-dimensional subspace.Right: the metric characterization in the low-dimensional space.Reproduced with permission.[192]Copyright 2019 Springer Nature; d) By learning the physical relationships between structural parameters and response characteristics, new designs with novel target attributes are constructed.Reproduced with permission.[193]Copyright 2020, American Chemical Society.

Figure 30 .
Figure 30.A comprehensive multi-attribute comparison of different design methods.Designable size: the maximum device size that algorithms can handle; DOF: designing degree of freedom; Efficiency: the inverse of the time overhead of designing a device; Economy: the inverse of the total cost of the entire design process.Universality: the ability of the model to address new design.The universality is strong if the current model can be directly migrated to design a new target (as the deep learning model); conversely, the universality is weak if new iteration and simulation are required (as optimization methods).Global search ability: the ability of the algorithm to escape from the local optimal solutions and explores the entire design space.

Figure 31 .
Figure 31.Fine-tunable and reconfigurable silicon photonic systems.a) Up: schematic diagram of the photonic emulator; Left middle: PCB image of the photonic emulator; Left down: thermal field distribution of the chip at different voltage configurations.Right down: overall photograph of the packaged chip.Reproduced with permission.[214]Copyright 2023, American Chemical Society; b) Left up: diagram of the core component of the programmable photonic circuit; Left down: schematic diagram of a photonic chip integrated with analog and digital control electronics; Right middle: optimized algorithm and user application programming interface (API).Reproduced with permission.[213]Copyright 2020, Springer Nature.

Figure 32 .
Figure 32.Visualization of the deep metric learning approach based on the Siamese network and Triplet loss.By minimizing the triplet loss during training, points from the same class are mapped to neighboring regions in the space, preserving the similarity between members of the same class in the form of intraclass distance.Here, the shape of the resonance lines is used to differentiate response categories, rather than the resonance wavelength.[216]Reproduced with permisssion.[216]Copyright 2023, American Chemical Society.

Figure 33 .
Figure 33.Preparation of a dataset for a 1 × 4 power splitter.a) Device pattern; b) The dataset is uniformly distributed in the design space (representation in a reduced 2D diagram); c) Extremely uneven dataset distribution in the response space (representation in a reduced 2D diagram).

Table 1 .
Comparison of the time consumption of different optimization methods.

Table 2 .
Design of DOF and minimum feature size for different optimization methods.