GPU‐Accelerated and Memory‐Independent Layout Generation for Arbitrarily Large‐Scale Metadevices

Metadevices are of paramount interest and importance due to their exotic capabilities in light control and manipulation in the nano‐scale regime. Various practical applications, including wearable devices, lasers, optical sensors, high‐resolution microscopy, both virtual and augmented reality, and metasails, require large‐scale metasurfaces. However, layout generation for large metasurfaces, faces challenges as up to billions of unit cells are required to create a metasurface, resulting in slow speeds and high memory demands. Here, a new framework namely ParallelGDS is proposed for fully parallel, ultra‐fast, and memory‐independent generation of graphic design system (GDSII) files for arbitrarily large metasurfaces. Compared to existing methods such as GDSTk, GDSPy, LUMERICAL's polystencil, the proposed framework significantly accelerates the layout generation process by factors of 10 to 100. More importantly, ParallelGDS drastically reduces the required memory (an O(n2)$\mathcal {O}(n^{2})$ problem) with an average reduction factor of 0.5×Dn2$0.5 \times D^{2}_{\text{n}}$ where Dn$D_{\text{n}}$ is the normalized metasurface diameter, as the framework uses only ≈2 GB of memory regardless of the size of the metasurface. The proposed framework offers complete control over memory requirements and parallelization levels of layout generation, ranging from single‐core CPU usage to multithreaded CPU utilization, and finally, full utilization of all GPU cores.

Typically, metasurfaces span a footprint of diameters on the order of 50-500 μm.However, numerous practical applications require devices of millimeter to centimeter sizes.These applications include medical devices, high-resolution microscopy, wearable devices, [42][43][44] virtual and augmented reality (VR) and (AR), [45] single-atom trapping, [46] large field of view (FOV) transmission-type eyepieces, [47] fingerprint imaging, [48] metasurface-based optical concentrators, [49] and metasails. [50]As the aperture size increases, metalenses admit a larger FOV and could achieve higher numerical apertures (NA), significantly enhancing the imaging quality in low light conditions. [51]hile providing many benefits, upscaling a metasurface footprint comes with a cost.Due to their nanoscale and periodic nature, metasurfaces typically contain millions to billions of metaatoms, and this number scales quadratically with diameter.For instance, a metasurface with a 1 cm diameter and 310 nm unit cell periodicity, contains roughly one billion meta-atoms.As a result, associating just one double-precision floating point number with each shape already requires several GBs of memory.The quickly growing memory constraints make it very challenging to simulate and generate a layout for these extremely large-scale metasurfaces.Beyond that, even if one successfully executes the simulation using workaround techniques such as stitching, the high amount of aggregate required memory plagues the generation of the layout file needed to fabricate the simulated device.This layout file generation is a nontrivial aspect of metasurface design and is affected in three fundamental ways by different constraints.First, the total size of the final layout file can easily occupy hundreds of GBs to TBs.This problem has been addressed by converting the final (GDSII) file to newer formats like open artwork system interchange standard (OASIS) [52] or METAsurface compression (METAC). [44]The second issue arising during the process is the memory bottleneck, in which, even on computers with hundreds of GBs of memory it is challenging to generate the layout file for large-scale metasurfaces due to the extraordinarily high amount of memory required for the layout generation process.For instance, it is not possible to generate the layout file for a 3 mm metasurface with 300 nm unit cell periodicity using the well-known GDSPy library [53] using a PC with 128 GB RAM as this process roughly requires 135 GB of RAM.The third problem arises because even on powerful PCs, layout generation is sequential, making the process incredibly slow.
Therefore, addressing the challenges of large-scale metasurface layout generation necessitates the development of an ultra-fast, parallel, and memory-independent method.To this end, parallel computing has revolutionized numerous research fields, making once-intractable computations possible, and reducing running times from days or weeks to just a few minutes.This could provide a promising remedy for the above issues in layout generation.The accessibility of hardware capable parallel computing and the utilization of general-purpose graphics processing units (GPGPUs) have led to significant advancements in areas such as machine learning and artificial intelligence.[56] We will show that the adaptation of parallel computing techniques to metasurface layout generation overcomes the aforementioned problems, making it possible to handle the complexities and generate layouts for metasurfaces with millions to billions of meta-atoms.
Here, we propose ParallelGDS, a new framework for the fully parallel, ultra-fast, and memory-independent generation of GDSII files for arbitrarily large-scale metasurfaces.Our framework uses a fixed amount of memory regardless of the size of the metasurface: only ≈ 2 GB of memory is required for the layout generation of metasurfaces with any arbitrary sizes.Performance of the proposed framework has been compared against well-known frameworks: GDSPy, [53] GDSTk, [57] and GDS export using LUMERICAL's "polystencil" [58] for gradient and geometrical metasurfaces of different sizes.It has been revealed that our proposed framework results in up to a 100-fold increase in generation speeds.Additionally, our framework considerably reduces the amount of required memory with respect to other frameworks by an average factor of approximately 0.5 × D 2 n where D n is the normalized metasurface diameter.In our proposed framework, we have also provided the arbitrary adjustment of the required memory and parallelization level, from just using a single core on CPU, to multithread CPU utilization, and finally to the full utilization of all GPU cores for fully parallel layout generation.This enables the user to easily generate the layout files even on low-end computers with almost any hardware configuration.The proposed framework paves the way for the layout generation procedure in current and future applications, requiring thin, lightweight, and very large-scale metadevices.

Overview of the Proposed Framework
GDSII is a database format for representing planar geometrical shapes and is one of the gold standards in the industry for transferring the layouts of integrated circuits.The database consists of hierarchically organized records and uses a binary format for its compactness.The coordinates in this format are defined as 4-byte signed integers and stored in big-endian byte order (the most significant byte of a multi-byte data word is stored at the lowest memory address).
Conventionally, the metasurface layout and the corresponding GDSII files are generated using the existing layout creation libraries (e.g., GDSPy, GDSTk, LUMERICAL's Polystencil, etc.) in two steps: In the first step, the geometric shape of each metaatom is created one at a time at the desired position and orientation (i.e., the coordinates of each points forming the shape are calculated).The shapes are kept in the memory until the entire layout is generated.In the second step, the shapes are converted (and wrapped in) to the GDS format and will be saved to the disk.Although this approach showed promising results, it demonstrated two major drawbacks when generating layouts for large-scale metasurfaces: the first drawback is the sequential nature and running time of this approach; for instance, to create a 3 mm gradient/geometrical metasurface layout with a period of 300 nm, approximately one billion meta-atoms should be created.In each time step, the shape of a single meta-atom is created, and this task must be repeated one billion times for the shape generation and one billion times for the conversion to the GDSII format.The running time of this method lies in the region of (n 2 ) as there exists a quadratic relation between the number of meta-atoms and the metasurface size.Therefore, this approach performs significantly slower as the size of the metasurface increases.It should be noted that converting the generated shapes to the GDS format is done sequentially as well.
The coordinates in GDS format are defined as a 4-byte signed integer and stored in big-endian byte order.The main drawback of this method is that when working with multi-byte data types, such as integers and floating-point numbers where bitwise operations are required to be performed on the individual bytes, storing data in big-endian format adds an additional conversion process where the data first must be converted to little-endian format in the CPU.The generated coordinates are scaled and rounded to the nearest integer and converted to the big-endian format afterward which makes the whole process slower and less efficient.Finally, a GDS record is created given the converted coordinates by injecting a record header and footer before and after the coordinate bytes, and the created record is stored on the disk.This process includes a large number of write-to-disk procedure calls, which further slows down the code.
The second major drawback is the lack of memory management which makes this method inefficient when dealing with large-scale layouts.In the generation process, the geometric shapes of meta-atoms take some space in the memory as they are created.These shapes are kept in the memory until the entire layout is created.It is only after that the shapes are converted to the desired format and stored on the disk, resulting in a significant memory overhead and processing time.This becomes increasingly challenging as the number of meta-atoms goes from millions to billions, requiring a significant amount of memory to generate the layout.
The proposed framework, ParallelGDS, aims to facilitate the generation of large-scale meta-devices by alleviating memory constraints and minimizing time-consuming processes.This framework embodies two primary characteristics: first, optimizing memory efficiency and eliminating hardware restrictions to the greatest extent, and second, significantly reducing the generation time by harnessing the parallelism capabilities of parallel computing devices.This framework achieves memory efficiency by following an approach analogous to the divide-and-conquer algorithm.Therefore, instead of generating and storing the entire layout in the memory, converting it, and saving it afterward; the target layout is generated, converted, and saved in batches of smaller layouts (as illustrated in Figure 1).Thus, always only a fixed amount of memory is required, regardless of the size of the metasurface.
To reduce the generation time, both generation and GDS conversion processes are implemented to run on parallel computing devices (using single instruction multiple data (SIMD) operations, multi-core and multi-threaded CPUs or GPUs), resulting in a significant boost in the generation process.An in-depth description of our method is discussed in Section 4.

Metasurface Design
To quantitatively and qualitatively assess the performance of the proposed framework, we have designed two different metasurfaces for the central wavelength of  0 = 650 nm: a metalens based on gradient phase and a zeroth-order Bessel beam generator based on Pancharatnam-Berry (P.B.) phase. [59,60]Numerical simulations are performed using the finite-difference timedomain (FDTD) module of ANSYS LUMERICAL for both unit cell and full-wave simulations of the whole metasurface.The gradient metalens is 30 μm × 30 μm and designed to create a focal point at z = 15 μm (to mitigate the extremely long simulation times and very large memory requirements, we will simulate a metalens of 30 μm × 30 μm but will keep the NA of the bigger metalenses the same as the designed 30 μm × 30 μm metalens).The required phase on the metalens (x, y) can be obtained using the following equation: where The zeroth-order Bessel beam generator metasurface with NA = 0.2 is designed based on the geometrical phase using the analytical equation for the transmission of a rotated unit cell: where t o and t e represent the complex transmission coefficients when the polarization of incident light is aligned along the principal axes of the meta-atom, and  is the rotation angle. [60]Considering a circularly-polarized incident light, the transmitted electric field can be mathematically described as: [60] E where, êx and êy are electric field components along x and ydirections.The required phase on the metasurface (x, y) can be obtained using: where x and y are points in Cartesian coordinates,  d is the working wavelength, and n is the order of the Bessel beam (n = 0 for our design).Consequently, rotations of each meta-atom is obtained by  = (x, y)∕2.Since each unit cell is a Pancharatnam-Berry optical element (PBOE), PBOEs must act as a halfwaveplate and transform the incident circularly polarized

Proposed Framework (ParallelGDS)
The proposed method tackles the issue of high memory requirements and slow generation times in the following manner.First, to address the excessive memory usage, the large-scale layout is broken down into smaller sub-layouts where only a fraction of the design is generated in each iteration.Second, to speed up the process, our proposed method reformulates the geometrical shape generation and conversion tasks to run on the parallel computing device (we have implemented our framework in a way that it can run either on multicore, multi-thread CPUs, or GPUs), allowing for parallel execution of the entire process.Here, PyTorch is used to implement the main components of the provided framework due to its excellent parallelism capabilities either on GPU or through heavy utilization of SIMD operations on CPU for data parallelism on the hardware level, and due to its optimized parallelism implementations for various tasks including matrix multiplications, vector operations, and convolutions.Additionally, designers can effortlessly define the necessary functions used in the layout generation with PyTorch.The proposed method (as illustrated in Figure 3) is explained in what follows.The following notations are used throughout the rest of the paper: Bold lowercase letters denote vectors, bold uppercase letters indicate tensors (matrices are also referred to as second-rank tensors in this text), and plain italic letters denote scalar quantities.In addition, subscripts indicate specific variables and superscripts indicate scalar elements within the vectors and matrices within the tensors.The entire process of creating the layout is divided into two parts: geometrical shape creation and GDSII format conversion.Considering a target layout for generation, the following steps are performed to create the geometrical shapes.
In the first step, the target layout is divided into an array of blocks, each representing a portion of the original layout.The size of the blocks determines how many meta-atoms are generated in parallel; it is fixed and is defined by the designer.By dividing the layout into blocks, only a fraction of the layout is generated and stored on the disk which obviously requires less memory compared to the sequential methods.Consequently, the geometrical shapes in each block are created and converted to the desired format in parallel in the next steps.In the second step, the coordinates of the shapes are extracted in the form of two vectors of t x , t y ∈ ℝ n from each block, where n indicates the number of shapes in each block, and ℝ defines the set of real numbers.Finally, given the coordinate vectors t x and t y , the transformational properties of the shapes are calculated using the phase profile function in the third step.This function produces the following vectors: , s x , s y ∈ ℝ n where  is the rotation vector, s x and s y , are the scaling vectors in each dimension.
In the fourth step, a 3D tensor T ∈ ℝ 3×3×n is created by stacking several transformation matrices (T i ) where superscript 0 ≤ i < n denotes the index of the vectors and tensors.A single transformation matrix is defined as follows (composition of scale, rotation, and translation matrices) The geometrical shape of the unit cell is also defined as a tensor U ∈ ℝ 3×m where the first and second dimensions are the point coordinates, the third dimension is a vector of ones enabling the translation operation in the transformation matrix and m determines the number of vertices used to create the shape.The unit cell shape can be created within the framework methods or can be manually loaded from a file (see Section S3, Supporting Information, for a list of exemplary unit cells that can be generated using our framework).The goal of this step is to map the shape to the desired position, orientation, and form.It should be noted that T can also model shearing, reflection, and in general, any mapping that can be expressed as a 3 × 3 matrix.The transformation tensor is then multiplied by the unit cell tensor, which results in a tensor of generated shapes S ∈ ℝ 3×n×m .The broadcasting feature of PyTorch handles the tensor multiplication with different dimensions.It eliminates the need for duplicating the unit cell tensor, making the entire process faster and saving memory.The S tensor (illustrated in step 5 of Figure 3) includes the generated shapes within the given block.Each 2D slice of S represents a single shape in the block.
In the following steps, we will discuss how the generated shapes are converted to the GDSII format: The augmented dimension of S (i.e., the vector of ones) is discarded and the following operations are performed to the coordinates of S (step 6 in Figure 3).The numbers are rounded to the nearest integer and their data type is changed from float to integer.Additionally, due to the problem already discussed about the mandatory conversion between the little-endian to big-endian, the byte order is converted from little-endian to big-endian.Since we have provided a vectorized implementation of this conversion in the proposed framework, this task can also run in parallel.As a result of this conversion, a new tensor S c ∈ ℤ 2×n×m is created, where ℤ denotes the set of integer numbers.Consequently, S c is interleaved in a way such that the coordinates of each shape are stored consecutively in a single vector; the resulting tensor is S i ∈ ℤ n×2m (step 7 in Figure 3).
The repetitive process of creating the GDSII records is parallelized by treating the record's header and footer as two numerical tensors.Record's header and footer are fixed-size arrays of bytes that contain information about each record, including the type of the record, the record's length, and its datatype.In the next step, these byte arrays are converted into the vectors of 4-byte signed integers (compatible with the data type of the coordinates).Finally, they are repeated and attached to the interleaved tensor to simultaneously create all of the records within their shapes.This process is illustrated in step 8 of Figure 3.The outcome is a tensor of records R ∈ ℤ n×(2m+r) (step 9 in Figure 3), where r is the length of the attached header and footer.In the final step, the tensor of records (R) is rearranged to form a single vector and converted to byte data type representing the data stream of the generated block.The resulting data stream is then written to the disk.This iterative process is repeated until all of the blocks are generated.
It is worth mentioning that GDSII inherently offers a built-in feature called reference that enables the designers to define and reuse complicated structures or cells throughout the design hierarchy, hence lowering file size and accelerating the layout generation process (see Section S1, Supporting Information, for a detailed description of the reference capability and its performance implemented in our framework against other frameworks).
It should be noted that our framework also offers an on-the-fly compression of the output format, with at least a 60% reduction in the file size (we have provided an option to either use this feature or not).

Results
The presented framework (ParallelGDS) has been employed to produce layouts for gradient and geometric metasurfaces of varying dimensions, previously designed in Section 3. Its performance is assessed in relation to prevalent GDSII generation techniques, namely GDSTk, GDSPy, and LUMERICAL's polystencil.All results were obtained using an identical computer system equipped with 128 GB of memory.As shown in Figure 4, the proposed framework demonstrates the ability to generate metasurfaces of arbitrary sizes without encountering any memory constraints.In contrast, alternative methods, such as GDSTk, GDSPy, and LUMERICAL's polystencil, fail to produce layouts for metasurfaces exceeding this threshold.Consequently, data points pertaining to these techniques have been extrapolated in Figure 4 for metasurface dimensions surpassing our local computer's 128 GB memory limitation.In terms of the required memory, one can see in Figure 4a,c that the proposed framework uses a fixed amount of memory (≈ 2 GB), regardless of the size of the metasurface, which enables us to generate metasurfaces of any arbitrarily large size.In contrast, as is evident in Figure 4a,c, the required memory increases rapidly in other methods.It has been revealed through a simple polynomial fitting that our method reduces the amount of required memory with respect to other methods by an average factor of 0.5 × D 2 n where D n is the normalized metasurface diameter, which is a magnificent achievement given the fact that the problem is (n 2 ) complex (e.g., an n-fold increase in the diameter of the metasurface results in n 2 -fold increase in number of meta-atoms).
As illustrated in Figure 4b,d, the layout generation time for the proposed framework demonstrates markedly superior performance compared to alternative methods.Notably, our approach results in up to two orders of magnitude reduction in generation time for gradient and geometrical metasurfaces.Using the proposed framework, generating the layout files for any arbitrary metasurface sizes is possible.For instance, we have successfully generated the layout file for a metasurface with a diameter of 5 cm and unit cell periodicity of 420 nm.

Investigating the Effect of Number of Vertices
To examine the influence of vertex count in each shape on generation time and memory requirements for the final layout, we varied the number of vertices constituting each shape from 10 to 100 in increments of ten.Ten distinct layouts were generated for a 3 mm × 3 mm geometrical metasurface with a block size of 128 × 128.As depicted in Figure 5a,b, increasing the number of vertices, will not affect the memory utilization of Parallel-GDS, whereas it can be seen that its counterparts experience a rapid increase in the amount of required memory.As expected, increasing the number of vertices, increases the generation time for all methods.However, one can clearly observe that the generation time for ParallelGDS is far below the other two frameworks.

Block Size Experiment
Although the optimal block size is contingent upon various factors including the characteristics of the parallel devices and the nature of the computational tasks at hand, determination of the optimal block size is a critical aspect of achieving maximum performance in any parallel computing system, as it directly impacts the processing capabilities of parallel devices.To investigate the effect of block size on the layout generation time, we consider a 3 mm × 3 mm P.B. metasurface where each shape is comprised of 64 vertices.Consequently, to investigate this effect, the block size has been varied from 8 to 1024 with steps in powers of 2. It should be noted that, there is a pre-processing step in EBL which is called fracturing where each shape is decomposed (fractured) into primitive shapes (rectangles and trapezoids for most of the machines).However, since the whole space is raster scanned in EBL machines along the horizontal and vertical directions, every shape (even primitives) has to be fractured to primitives with at least two edges parallel to either horizontal or vertical directions.For the ellipses, because of the round edges, it can be fractured infinite number of times.We have implemented a pre-fracturing process for ellipses, where each ellipse is fractured into odd number of rectangles with their axis aligned with the rotated axis of the ellipse (see Section S2, Supporting Information, for a detailed description of the proposed pre-fracturing algorithm, which will lead to consistent ellipses after fabrication).As depicted in Figure 5c, it is evident that increasing the block size results in a significant reduction in generation time up to a critical point.This is due to the enhanced exploitation of parallelism, which allows multiple tasks to be executed simultaneously, thereby increasing the system's overall efficiency.However, after surpassing this critical point which corresponds to the global minimum of Figure 5c, the generation time rises again.This phenomenon can be explained by the saturation of the parallelism capability of the device, as the increase in block size beyond the critical point no longer contributes to the augmentation of parallelism in the pipeline.As resources become increasingly strained, the parallel devices are unable to sustain further improvements in performance, ultimately leading to diminished efficiency.Considering the current design as an example, our analysis has demonstrated that, the optimal block size is 128 × 128 (which means simultaneously generating 16,384 metaatoms), which results in the fastest processing times, thereby maximizing the potential benefits of the parallel computing system.

Performance Evaluation on Low-End Computers
As the last experiment of the current study, and to prove the excellent adaptivity of the proposed framework, we will compare the performance of our proposed method against GDSPy and GDSTk, on a low-end computer with 8 GB of memory, and an Intel Core i7-6500U (two cores, four threads with maximum boost frequency of 3.1 GHz) CPU.The test case is a P.B. metasurface with 64 vertices in each shape.As can be seen in Figure 6, our proposed framework is able to easily generate metasurfaces of any arbitrary sizes (the only upper limit on the size of the metasurface is the amount of available hard disk space on the machine).In contrast, GDSPy and GDSTk can only generate metasurfaces up to 1 mm.
It is essential to highlight the minor difference in memory consumption between the current investigation as shown in Figure 6 (≈ 1.5 GB), and the previous experiment presented in Figure 4, (≈ 2 GB).The reduction in memory utilization is attributed to the deployment of a low-end machine in the current experiment, in which the GPU is not employed for processing tasks.Instead, all computations are executed by the CPU.As a result, the GPUrelated modules, which would typically be initialized upon utilizing the GPU at the beginning of the layout generation process, are rendered unnecessary.Consequently, the elimination of these modules results in a reduction of the amount of memory required for the successful execution of the experiment.

Conclusion
In conclusion, the proposed framework (ParallelGDS) has demonstrated a significant advancement in generating GDSII files for large-scale metasurfaces, addressing the critical challenges of slow generation speeds and large memory requirements.Through extensive comparison with existing methods such as GDSTk, GDSPy, and LUMERICAL's polystencil, Paral-lelGDS showcases remarkable improvements in both speed and memory reduction, with at least 10-fold and up to a 100-fold increase in the layout generation speed and a factor of ≈ 0.5 × D 2 n reduction in memory requirements where D n is the normalized metasurface diameter.Unlike conventional methods that face limitations due to hard disk space and available memory, our framework only requires a relatively small amount of RAM (≈ 2 GB) to generate meta devices of arbitrarily large sizes.Thus, the maximum size of the meta devices solely depends on the available storage space on the machine's hard disk drive.Furthermore, the framework's adaptability in terms of memory usage and parallelization levels enables it to cater to a wide range of computational resources.This ranges from single-core CPU usage all the way up to multi-core and multi-threaded CPU utilization, and ultimately to full utilization of all GPU cores, which makes it accessible even on low-end personal computers.Paral-lelGDS is set to profoundly impact a vast variety of applications requiring thin, lightweight, and very large-scale metasurfaces.The proposed framework marks a pivotal turning point in the advancement of metadevices technology, paving the way for the realization of numerous state-of-the-art applications with previously unachievable levels of efficiency and accessibility.

Figure 1 .
Figure 1.Brief overview of ParallelGDS.The target layout (layout to be generated), is partitioned into an array of blocks and consequently the blocks are fed to the parallel computing device where all the shapes in each block are generated in parallel.The blocks are saved to the disk after the generation.
and y are points in Cartesian coordinates,  d is the working wavelength, and f is the focal point of the lens.The optimum values of the geometrical parameters for the unit cells (nanoposts) are found through the sweep of different geometrical parameters using the FDTD module of ANSYS LUMERICAL to achieve the maximum transmission and the required phase for different radii while keeping the height of the nanoposts fixed.Each unit cell consists of a cylindrical TiO 2 nanopost (H = 600 nm, P = 420 nm) on a SiO 2 substrate.The unit cell, full structure of the lens, a perspective view of the zoomed-in region of the metalens, the phase, and the electric field intensity profiles in xzand xy-planes, are shown in Figure 2a-f.

Figure 2 .
Figure 2. Metasurface design.a) Unit cell of the gradient metalens, b) a 3D-rendered full structure of the gradient metalens, c) a perspective view of the zoomed-in region of the gradient metalens, d) the phase of the gradient metalens, e) the electric field intensity profile in xz-plane of the gradient metalens, and f) the electric field intensity profile in xy-plane of the gradient metalens.g) Unit cell of the geometrical phase zeroth-order Bessel beam generator, h) a 3D-rendered full structure of the geometrical zeroth-order Bessel beam generator, i) a perspective view of the zoomed-in region of the geometrical zeroth-order Bessel beam generator, j) the phase of the geometrical zeroth-order Bessel beam generator, k) the electric field intensity profile in xz-plane of the geometrical zeroth-order Bessel beam generator, l) the electric field intensity profile in xy-plane of the geometrical zeroth-order Bessel beam generator.

Figure 3 .
Figure 3. Flow chart of the layout generation using ParallelGDS.The layout is generated in two stages: first, the geometrical shapes in each block are generated (light green block), and then the shapes are converted to GDSII format (light blue block).The entire process is performed in the parallel computed device.Concretely, the following steps are performed to generate the layout: 1) the target layout is divided into blocks, 2) the location of the shapes is extracted in the form of two vectors from each block, 3) the vectors of transformational properties are calculated using the phase profile function, 4) the transformation tensor (T), and the unit cell tensor (U) are created and multiplied by each other, 5) all of the shapes in the block are generated (tensor S) from the multiplication of step 4, 6) the tensor is converted from integer to float, and the byte-order is converted from little-endian to big-endian (results in tensor S c ), 7) the tensor is interleaved (results in tensor S i ), 8) the header and the footer of the record are being attached to the resulting tensor, 9) the tensor of records (R) are created, 10) the tensor of records are converted to a stream of bytes and it is saved to the disk.

Figure 4 .
Figure 4. a,b) Comparison of the required memory and generation time of ParallelGDS against GDSPy, GDSTk, and LUMERICAL's polystencil for the gradient metasurface.c,d) The same comparison as in (a,b), but for the geometrical metasurface.It should be noted that in all subfigures, dashed lines indicate extrapolated values.

Figure 5 .
Figure 5. a,b) Comparison of the required memory and generation time of ParallelGDS against GDSPy and GDSTk while varying the number of vertices.c) The block-size experiment in which the optimal block size is determined by generating the same structure while varying the block size.

Figure 6 .
Figure 6.Performance evaluation of the proposed method (ParallelGDS) against GDSPy and GDSTk on a low-end computer with 8 GB of memory and an Intel Core i7-6500U (two cores, four threads with maximum boost frequency of 3.1 GHz) CPU.a) The required memory and b) generation time.