High-throughput discovery of novel cubic crystal materials using deep generative neural networks

High-throughput screening has become one of the major strategies for the discovery of novel functional materials. However, its effectiveness is severely limited by the lack of quantity and diversity of known materials deposited in the current materials repositories such as ICSD and OQMD. Recent progress in machine learning and especially deep learning have enabled a generative strategy that learns implicit chemical rules for creating chemically valid hypothetical materials with new compositions and structures. However, current materials generative models have difficulty in generating structurally diverse, chemically valid, and stable materials. Here we propose CubicGAN, a generative adversarial network (GAN) based deep neural network model for large scale generation of novel cubic crystal structures. When trained on 375,749 ternary crystal materials from the OQMD database, we show that our model is able to not only rediscover most of the currently known cubic materials but also generate hypothetical materials of new structure prototypes. A total of 506 such new materials (all of them are either ternary or quarternary) have been verified by DFT based phonon dispersion stability check, several of which have been found to potentially have exceptional functional properties. Considering the importance of cubic materials in wide applications such as solar cells and lithium batteries, our GAN model provides a promising approach to significantly expand the current repository of materials, enabling the discovery of new functional materials via screening. The new crystal structures finally verified by DFT are freely accessible at our Carolina Materials Database http://www.carolinamatdb.org.


Introduction
Data-driven accelerated design of new materials is emerging as one of the most promising approaches for addressing the challenges in finding next-generation materials. Currently, one of the main strategies for materials discovery is screening existing materials databases [1,2,3,4]. However, such approaches are severely limited by the scale and diversity of the existing structures in the repositories, such as ICSD and Materials Project (MP), which have about arXiv:2102.01880v2 [cond-mat.mtrl-sci] 5 Feb 2021 165,000 and 125,000 materials, respectively, compared to the almost infinite chemical design space. For example, lithium compounds are widely used in electric vehicles and mobile phone batteries, but there are only 16,000 different lithium compounds in the MP database, which has been almost exhaustively screened for better lithium-ion battery [5,6]. Large-scale generation of stable hypothetical crystal structures is strongly needed to significantly expand the current materials repositories in both the quantity and compositional and structural diversity to increase the success rate of high-throughput screening of novel functional materials.
The properties of materials are closely linked to their crystal structures. Traditionally, materials scientists discover new materials by either trial-and-error or heuristic random-guess approaches, both of which are notoriously labor-intensive. One example database is Inorganic Crystal Structure Database (ICSD) [7], which collects almost all discovered materials since 1913. To date, only around 165,000 experimental structures are reported in ICSD. Considering the number of elements in the periodic table and their possible combinations, the design space of materials would be infinite combinatorially. Hence, better approaches for new materials discovery is needed.
Several working directions are investigated for the generation of new materials [8,9,10,11,12,13,14,15,16]. There are mainly three different ways to generate or discover new crystal structures including doping/element substitution [17,5,18,19], composition generation plus crystal structure prediction [9], and generative machine learning models [12,20,14,21,15,16]. The element substitution approach is the most widely used strategy. But it is subject to the extremely limited known prototype structures in the database compared to the vast chemical design space. The second approach can exploit the recently developed generative models [12] to generate a large number of hypothetical materials compositions and then use crystal structure prediction codes to predict their structures. Many global optimization methods have been developed to search the appropriate compositions and structures, including simulated annealing [22], basin hopping [23], minima hopping [24], genetic and evolutionary algorithms [10,11]. Those approaches generally guide the searches towards the local minima of free energy to identify the stable or meta-stable structures either by initial configuration space or chemical composition. However, these crystal structure prediction algorithms are usually too computationally expensive due to their reliance on DFT-based formation energy calculation and can thus only handle relatively simple structures. For complex structures, most of the time, these methods fail to find the ground truth structures corresponding to the global minimum formation energy.
One of the most promising approaches for new materials structure creation is deep generative machine learning models [12,14,15,16,25,26,13,27]. Both variational autoencoder (VAE) [15,14,26,27] and generative adversarial networks (GAN) [12,13,16,25] have been adapted for inverse design of inorganic materials with different crystal structure representations. A VAE model contains two parts: an encoder and a decoder [28,29,30]. The encoder part encodes the crystal structure distribution into a latent space, and the decoder reconstructs the material structures from the latent space. After training, new material structures can be generated by sampling in the latent space. Conversely, a GAN generator model consists of two neural networks: a discriminator (critic) and a generator, both of which are trained simultaneously. The discriminator is trained to differentiate real materials from fake ones generated by the generator, while the generator tries to generate fake materials as real as possible to fool the discriminator. The nash equilibrium achieved by the discriminator and the generator helps a GAN learn the distribution the materials implicitly. In the past few years, several inorganic materials generative models have been proposed. Those works are limited by their chemical family (e.g. special oxides) [15,16,13] or formulas generation [12] or hydrides [25]. Noh et al. [15] present a framework for learning a continuous vector for vanadium oxides using VAE, which is trained on a 3D image-like representation to attain the continuous materials space. Two sampling strategies are used in the latent space to generate only V x O y materials. Training the VAE model using 3D grid representation is computationally demanding and memory-hungry. In [16], Kim and Noh et al. trained a composition family-specific GAN model on the Mg-Mn-O system using the atom coordinates as the representation of materials. The crystal GAN model is composed of three modules: a generator, a critic (discriminator), and a classifier. The critic calculates the Wasserstein distance between real and fake materials [31]. The classifier module ensures that the generator generates desired composition and atom numbers in the unit cell. However, this model can only be used to generate structures of the Mg-MN-O system, and the model quality is limited by the small dataset since there are only limited known compounds of this chemical system. CrystalGAN [25], proposed by Nouira et al., consists of a cross-domain GAN model, which maps one hydride system into another using CycleGAN schema [32]. All these works focus on generating materials of a special material system. In a most recent work, Ren et al. [14] proposed a new VAE model that directly uses the atom coordinates and unit cell lattice parameters to encode the structures. To constrain the neural network model behavior, their invertible representation encodes the crystallographic information into the descriptors in both real space and reciprocal Fourier space crystal properties. Their model is trained with 24,785 unique ternary materials and can generate interesting new structures. However, most of their new structures are generated by perturbing the latent vectors of known materials. Large-scale generation of stable crystal structures remains a challenging problem. Other than generating materials structures, Dan et.al. [12] proposed MatGAN to generate millions of novel materials formulas with chemical validity, which expands the candidates for inverse design of new solid materials.
In this work, we propose a novel deep generative model called CubicGAN to generate cubic materials structures on a large scale. Ternary materials selected from the OQMD [33] database are chosen as our training set because of its large size of materials and diverse compositions. In our model, material structures are represented by their lattice parameters, atom coordinates, element embedding, and the space group. The conditions of a specific space group and three elements are fed to the generator to generate desired crystal material structures. We trained ternary and a quarternary GAN models to generate novel cubic (ternary and quarternary) crystal structures of the space groups 216,255,221. Materials of these three space groups consist of 78.5% of all ternary and quarternary cubic materials in OQMD, covering a majority of known cubic materials space.
Our systematic experiments show that our CubicGAN model can recover not only many of the known cubic structures but also discover many new materials with new composition prototypes with different anonymous formulas (new prototypes). Additional large-scale DFT based validation has led to the discovery of 506 new cubic crystal materials of new prototypes. The detail of the CubicGAN model will be explained in the following sections. Compared to [15,16,25], our framework can generate a large variety of materials of different chemical systems. The only work that is similar to ours in terms of variety of materials is [14], in which Ren et al. use VAE rather than GAN as the generative model trained with train ternary materials in Materials Project [34] database. However, their model tends to generate new samples by interpolation. The second major difference is a much simpler representation is used in our work without the momentum space representations.
Our contributions can be summarized as follows: • We propose a novel GAN model to generate large-scale cubic materials conditioning on the elements and a specified spacegroup. In total, we generate 10 million hypothetical ternary and 10 million quaternary crystal structures for downstream analysis.
• We perform three stage checks on generated materials and extensively match the generated materials against existing databases. The results show that our method can rediscover a majority of cubic materials in the existing databases. In addition, most of the rediscovered materials from MP are confirmed as stable or meta-stable materials in terms of energy-above-hull.
• We perform DFT simulations on 108,897 hypothetical materials, of which 33.8% novel materials are successfully relaxed. By further analysis, we demonstrate that new crystal structure prototypes (with different anonymized formula types) can be found, such as ABC 6 -216, ABC 6 D 6 -216, and AB 8 C 12 -221.
• By further stability verification, 506 new-prototype materials have been generated and confirmed to be stable by phonon dispersion calculation.

Methods
In this work, we focus on training generators of ternary and quarternary cubic crystal structures of three space groups (216, 221, 225) to simplify our model design while ensuring coverage of a majority of the cubic design space. We find that in the OQMD dataset with 813,839 materials, 85.8% of them are ternary or quaternary materials. In addition, out of all the cubic crystals, 97.8% of them belong to these three space groups, again covering the majority of the known cubic materials space. These three space groups are selected because we find that for the materials of these three space groups, most of their nonequivalent atom fractional coordinates in the CIF files have a multiplicative factor of 0.25 or belong to this set [0, 0.25, 0.5, 0.75]. So, instead of generating cubic structures with arbitrary real-valued atom coordinates, we only aim to train a cubic material generator that only generates structures whose atom positions are sitting at positions with their fractional coordinate values to be from this set +/-0, 0.25, 0.5, 0.75. In this case, the special discrete fractional coordinates are much easier to generate accurately by our deep neural networks. This decision has dramatically simplified our generation model, and thus we choose the training data with these two criteria:ternary and quaternary cubic crystal structures of three space groups (216, 221, 225).

Dataset
We collect the training data from OQMD [35,33], which is an open-source database of experimental and DFT-calculated materials. Totally 813839 entries are retrieved from version 1.3 of OQMD. Entries calculated with local-density approximation (LDA) are also included. Among them, we successfully build 556,839 and 141,100 POSCAR files for ternary and quaternary materials in the OQMD, of which 505,456 and 127,659 structures belong to cubic crystal systems, respectively. After converting the POSCAR files to symmetrized CIF files, 411,646 ternary materials have three unique nonequivalent atom sites, of which 388,680 materials of cubic crystal systems are found; 129,514 quaternary materials have four unique nonequivalent atom sites, of which 127,523 materials belong to the cubic crystal systems. Table 1 shows the statistics of OQMD materials distributions. We can find that ternary materials of cubic crystal systems are the largest chuck (91%) out of all ternary materials. Similarly, it is observed that the ternary cubic structures with 3 nonequivalent sites are 94% out of all ternary materials with 3 nonequivalent sites. For quaternary materials, these two percentages are 90% and 98%, respectively. This means that our CubicGAN model can be used to generate hypothetical cubic materials that are the majority type of known material category. Another key criterion for selecting our training samples is that we only pick cubic structures with three nonequivalent atom positions (in CIF files) for training ternary GAN model (for quarternary GAN, the number is 4). Making this choice allows us to use a unified matrix of dimension (28 × 3) to represent all ternary cubic materials (for quarternary materials, the dimension is 27 × 3 where only one space group is used in this work). For a given material, once we have its nonequivalent positions and space group, the full atom positions within the unit cell can be converted to conventional atom positions by symmetry operations. We have identified 411,646 ternary materials with only three nonequivalent positions, of which 388,680 (94%) materials belong to cubic crystal systems as shown in Table 1. Out of these 388,680 materials, 22 space groups are found as shown in supplementary Figure 1,. Among them, the space groups that have the most numbers of materials are Fm3m and F43m (the total portion of these two space groups is 97.2%). Pm3m is the third one with only 6,462 samples or 1.7%. After removing the duplicate cubic materials within MP and ICSD, almost all the materials (375,749 out of 384,215) with the space groups of Fm3m, F43m and Pm3m follow this criterion. Table 2 shows the overall statistics of our finalized training and validation datasets. In total, we have selected 375,749 ternary materials from three cubic system space groups from OQMD to form the OQMD-TC3 (T:Ternary, C-Cubic, 3-three space groups) training dataset: Fm3m, F43m, and Pm3m each having 186,344, 184,162 and 5,243 materials respectively. These materials together correspond to 249,646 unique formulas. With this diversity of formulas, our CubicGAN model can efficiently learn valid combinations of ternary elements. The unique 84 elements in the datasets are utilized to generate random three-element combinations during GAN training. The same steps are applied to quaternary materials in OQMD. As shown in supplementary Figure 1, materials with space group F43m occupies 95% of the quaternary data. So for training the quaternary GAN model, we only choose materials of space group F43m.
We will use the ternary data from the Materials Project and ICSD databases as validation sets to check the rediscovery rates for our proposed method. We first process the ternary materials in Materials Project database [34] and ICSD [7] as we do for OQMD samples to create the MP-TC3 and ICSD-TC3 validation datasets. In total, 6,545 cubic materials are retrieved, of which the numbers of materials with Fm3m, F43m and Pm3m are 4576, 520, and 1449, respectively and there are 6,431 unique formulas existing in the whole retrieved data. From the ICSD database, 1,875 cubic materials are found to satisfy our seleciton criteria, of which the numbers of materials are 804, 280, and 791 forf space groups Fm3m, F43m and Pm3m.. For quaternary materials, the OQMD-QC1 training dataset has 121,008 samples. However, only 39 and 8 quaternary materials are found in MP and ICSD that satisfy our two selection criteria (SeeTable 2). We have removed these samples from our training dataset selected from OQMD by removing the crystal structures with a minor difference of cube lengths from the samples in the validation sets). In terms of prototypes in the validation datasets MP-TC3 and ICSD-TC3, supplementary Table 3 shows details of the existing prototypes for materials that satisfy our selection criteria. We take the prototype "ABC2-225" as an example. Here ABC2 and 225 are the crystal prototype anonymous formula and the space group number used to denote a prototype, and we will use this format in the following content. Overall, the three databases have the same set of prototypes; other than that, MP has an extra one: AB6C6-225. However, only one material (mp-1147668) is found under AB6C6-225 and is unstable. For quaternary materials in OQMD, there are only two prototypes, including ABCD-216, with 121,006 materials and ABCD 6 -216 with two materials. Moreover, we find that quaternary cubic materials distribution is highly biased with 121,018 belonging to space group 216, and only 5674 belonging to space group 225, and no samples found for space group 221. For simplicity, we train the quaternary CubicGAN using only the samples from space group 216 and it then can only generate samples of this space group.  Table 1. Taking a randomly selected space group (one-hot encoded), 3-element combinations (one-hot encoded), and random noise Z as inputs, the generator then generates material structures with the specified space group and element constituents. Space groups and elements are mapped into dense vectors by their corresponding embedding layers. The number of atoms for each element does not need to be specified as it can be determined by the space group symmetry operations. The random selections of space groups are based on the portions of three cubic space groups considered in our model: Fm3m, F43m and Pm3m. The detailed architecture is shown in Supplementary Figure 2.

CubicGAN Framework
An input to the discriminator has four parts: nonequivalent atomic coordinates, element properties, unit cell parameters, and space groups as shown in Figure 1. The coordinates part includes the fractional coordinates of three nonequivalent atoms. For three unique elements in each material, each element is represented by 23 properties as shown in the supplementary Table 1. Since the lattice lengths a, b, c are the same in cubic crystals, we only need to use one value to represent it. Three cubic space groups are one-hot encoded. As shown in supplementary Figure 2, four parts are concatenated together to form a tensor with the dimension of 3 × 28. The input is then forwarded to four 1D convolutional layers, of which the kernel size is 1 × 1, which is used to capture the implicit relationships among the four parts. We use two CNN layers to reduce the dimension from three to one. Then, a few fully connected layers are used to map them to Wasserstein estimation [31]. The detailed network settings are shown in supplementary Figure 2. In standard conditional GAN, the input of the generator includes the random noise and a condition vector [36]. Here, we add a space group embedding layer and an element embedding layer as shown in Figure1 to map the randomly selected one-hot encoded space group (chosen from 216/221/225) and three randomly selected elements (one-hot encoded) into the latent vectors. The reasons for this design are as follows: 1) As only three dominant cubic space groups are used in this work, the combination of atom positions with corresponding elements, unit cell lengths, and one-hot encoded space group symmetry is sufficient to describe a material structure; 2) Using element properties as part of the representation makes the generator learn to generate chemically valid materials, e.g., structures that do not violate Pauling's rules. As our previous work [12] shows, the composition constraints can be learned from the compositions of existing materials.
Here, our CubicGAN is also configured to learn both implicit compositional as well as structural constraints to help the generator generate only valid ternary or quaternary formulas as much as possible; 3) Our 2D representations of the cubic structures also matches well with the convolutional layers used in the discriminator, in which the convolutional operations can extract implicit relationships among four parts of information.
The generator and the discriminator of the CubicGAN model are trained with the loss function of Wasserstein distance [31] which measures the dissimilarity between distribution differences of real and fake materials. Compared to loss functions used in traditional GAN [37], Wasserstein distance improves the model stability and prevents the mode collapse. We use the gradient penalty to clip weights in order to improve the stability of training as done by Gulrajani et al. [38]. The penalty of gradient norm with respect to the inputs works as a regularization term to stabilize the training process of the GAN. More formally, our cost function for GAN training is as follows: where D is the discriminator, Px is the distribution of interpolated samples between the distribution of real materials P r , and the distribution of generated materials P g . λ is the balancing parameter, which is set to 10 in this work.
After inspecting the generated structures by the GAN, we find that the generated lattice parameter a is often not good enough, leading to overlapping atom clusters. To address this issue, an additional post-processing step introduced to predict the lattice length a using a composition based machine learning model that we recently developed [39], which achieves a R 2 score of 0.979 for cubic lattice a prediction.
During training the GAN, real materials are randomly picked in batches. With the fused matrix of generated materials as shown in Figure 1, they are fed to the discriminator in a mixed manner. We set the number of iterations of the discriminator per generator iteration as 5. The GAN model is developed using the open-source libraries of TensorFlow [40] and Keras [41]. More details regarding model architecture and hyper-parameter setting can be found in Supplementary   There are three major criteria for evaluating generative models, namely, validity, diversity, and uniqueness [42]. After training the ternary CubicGAN using the OQMD-TC3 dataset, we generate 10 million cubic structures of the specified three cubic space groups (225,216,221). The proportions of the samplings are set as identical to the training set, which is 49.6%-49.0%-1.4% respectively. To evaluate the generation performance, we first check how the percentage of the generated charge-neutral samples changes with respect to the total number of generated samples. The charge neutrality check is based on Pymatgen [43] using the common valence values of elements as defined in Pymatgen. As shown in Figure2a, the charge-neutral samples' percentage maintains around 41% over the whole process of generating 10 million samples, which means that when we generate 10 million samples, appropriately we can get 4.1 million charge-neutral samples for downstream screening. We then checked how the percentage of the generated samples have pymatgen-readable CIFs (Crystallographic Information File), unique CIFs, and unique formulas, which reflect the diversity and uniqueness of the generator. In Figure 2a, the blue line demonstrates the percentage of cifs readable by pymatgen in terms of sampling size, and the sampling size is from ten thousand to ten million. In this work, pymatgen-readable means that CIFs can be recognized as the space group that is assigned to. We can find that the percentage of readable CIF files is stable no matter how we run the sampling. After removing the duplicate materials, we calculate the percentage of unique CIFs and unique formulas as denoted by the yellow and red lines in Figure 2a.
Only those materials that have the same formula and the same corresponding atom positions are considered as duplicates here. It is found that the percentages of the unique CIFs and unique formulas are decreasing and growing flat. From these observations, we believe that our GAN model might have explored the majority of the cubic crystal structure space but have not exhausted it yet.
Another effective way to evaluate the CubicGAN's performance is to check how soon it can rediscover the known cubic crystals in leave-out datasets of existing databases. To do this, in our training dataset, we have removed all the materials of the three cubic space groups (216,225,221) existing in MP and ICSD databases, which are 6,545 and 1,875, respectively. It is interesting to see how many of those leave-out cubic materials can be rediscovered by our GAN model as the sampling size goes from ten thousand to ten million. Figure 2b shows how the rediscovery percentages of the cubic crystals of the three space groups (216,225,221) change as the sampling size increases. Figure 2b shows the rediscovery rates over time of sampling. At first, we check how the percentage of the rediscovered cubic samples out of all training samples (blue line) changes while generating more samples. It is found that this training set rediscovery rate increases consistently over the sampling process. It soars quickly to 88% when the sampling size increases until 5 million samplings are reached. At the end of 10 million samplings, the rediscovery rate reaches 95.5%. Similar patterns can be observed for the rediscovery rate curve for the MP-TC3 validation dataset, as shown by the green line. With the increasing number of samplings, the rediscovery rate reaches 72.0%. This saturated percentage is much lower than that of the training set, which is due to MP-TC3 data has different proportions over the three space groups (225,216,221), which are 69.9%-7.9%-22.1% respectively compared to 49.6%-49.0%-1.4% of the training set. Since our generation process is based on the space group proportions of the training set (which focuses on generating candidates of space groups 225 and 216, the 72% rediscovery rate is close to the percentage of these two types of samples in MP-TC3 (69.9%+7.9%=77.8%). We also find that half of the rediscovered materials in MP-TC3 are stable based on the formation energy and e-above-hull criteria. Details of the stability analysis can be found in Supplementary  Figure 3. The rediscovery rate pattern over ICSD-TC3 is similar to that of MP-TC3 except that the highest rediscovery rate is 50.7% at the sampling size of ten million, which is close to the percentage of total samples of space groups 225 and 216 (42.9%+14.9%=56.8%). These high rediscovery rates over the training set and the two validation sets demonstrate that our CubicGAN has learned the implicit chemical rules of the cubic structures to generate in a much better way than random sampling. After the sampling size reaches 7 million, the number of materials rediscovered converges, indicating that ten million samplings could be a reasonable size to cover most of the cubic structures since they seem to have almost exhaustively explored the search space of materials that meet our criteria. Therefore, we use ten million samplings for further analysis.
To compare how our CubicGAN performs compared to random sampling or exhaustive enumeration, we calculate the enrichment score for our ternary CubicGAN. As we are searching candidates of three cubic space groups with three unique sites of three distinct elements and the only possible fraction coordinates are 0,0.25,0.5,0.75, the total possibility of configurations are (4 3 ) 3 * 85 * 84 * 83 * 3 = 466, 055, 331, 840, which is much larger than the corresponding combinations of the ternary composition space [12]. Considering that with 10 million samplings, we have rediscovered 95.5% of the OQMD-TC3 dataset, the enrichment score is approximately 44,507, which is a significant boosting for generating chemically valid crystal structures compared to exhaustive enumeration. While rediscovery rate analysis over the MP-TC3 and ICSD-TC3 validation sets have demonstrated the accelerated sampling in cubic structure space, there are only 6,545+1,875=8,420 validation samples plus the 358,840 rediscovered training samples. It is still desirable to check the chemical validity of the remaining 96.33% generated samples and filter out those promising new materials. With 10 million hypothetical cubic materials, it is impractical to perform DFT calculations for all of them to verify their chemical validity and stability. Here we adopt three stages of validation check to reduce the pool of samples for DFT validation. We use the CGCNN based graph neural network model for formation energy prediction, which was trained with samples from Materials Project database [44]. Then we scan the generated materials in the order of space group match, charge neutrality, and formation energy filtering. The nonequivalent coordinates are transformed by symmetry operations provided by relevant space groups used when generating samples. With the full coordinates set, elements, unit cell parameters ( unit cell length a and angles, which are always 90 degrees in cubic systems), we could write a Crystallographic Information File. The space group check is performed by Pymatgen [43] in the first place (we refer to this check as a Pymatgen-recognizable check). If the generated sample cannot be recognized by Pymatgen or the space group analyzed by Pymatgen is not consistent with the space group given to the generated sample, this sample is considered as a failed generated case. As shown in Table 3, in total there are 2,558,678 and 5,498,267 valid ternary and quaternary CIFs have been found from 10 million generations, respectively. From them, candidate materials with charge neutrality and CGCNN-predicted negative formation energy are reserved for further DFT calculations based verification.

Large scale generation of new cubic crystal materials structures
A major evaluation of our CubicGAN model is to check whether it can generate new cubic materials with novel prototypes, which are represented by distinct anonymized formulas in Pymatgen. As shown in Table 3, we find that 24 and 1 novel prototypes for ternary and quaternary materials, respectively, have been found in our generated samples that are not existent in our training data. For relieving the burden of DFT calculations, we choose to remove the samples that contain Lanthanoid and Actinoid elements. In total, 1,064,650 ternary materials are left, of which 209,744 materials are of new crystal prototypes. The distribution of prototypes for 1,064,650 materials is shown in Supplementary Figure 4.
Similarly, 4,382,130 quaternary materials are left after removing Lanthanoid and Actinoid elements, of which 260,891 materials are of the new crystal prototype (the prototype ABC 6 D 6 -216). Since only two ABCD 6 -216 materials exist in the quaternary training dataset OQMD-QC1, we also include ABCD 6 -216 materials for the downstream DFT analysis considering the huge number of generated ABCD 6 -216 samples (1,655,407

Discovery of 506 new-prototype stable materials verified with DFT calculations
After filtering down materials with novel prototypes, we perform DFT optimization on materials with CGCNN-predicted negative formation energy, and we use Γ points and mechanic constants to further scale down the successfully relaxed structures. Phonon dispersion is the eventual criterion to determine the stability of structures.
Gamma points and mechanic constants filtering The vibrational frequencies at the Γ point together with the elastic constants of screened structures were obtained by calculating the Hessian matrix (matrix of the second derivatives of the energy with respect to the atomic positions) [45], which can be done by setting IBRION=6 (NFREE= 4) in VASP run. For cubic structures, the mechanical stability of lattice structures is verified as C 11 > 0, C 44 > 0, C 11 > |C 12 |, C 11 + 2C 12 > 0, where C ij are components of elastic constant matrix [46]. After screening the materials with the mechanical criteria, we further narrow-down the materials by checking the vibrational frequencies at the Γ point. All materials with negative Γ point frequencies were discarded.
Phonon Dispersion calculation After the structures pass the mechanical stability criteria and all Γ point frequencies are positive, we further calculate the full phonon dispersions in the first Brinounion zone (BZ). All 2 nd interatomic force constants (IFCs) of the cubic structures were computed in a 2x2x2 supercell based on their corresponding primitive cell. Then, the phonon dispersions were calculated by using the PHONOPY package [47] with high symmetry paths In total, four prototypes with stable materials are discovered: ABC 6 -216, AB 6 C 6 -225, ABCD 6 -216, and ABC 6 D 6 -216. The details of the number of materials for each prototype are shown in Supplementary Figure 5. To our best of knowledge, ABC 6 -216 and ABC 6 D 6 -216 are novel prototypes that are not in our training dataset, and the validation sets MP-TC3 and ICSD-TC3. Also, the AB 6 C 6 -225 prototype is not in the training dataset and only one unstable material can be found in MP. However, our method finds 42 stable ones. Two materials of ABCD 6 -216 prototype are in the training dataset, and several others are in MP and ICSD. We expand the datasets by finding 62 stable materials of prototype ABCD 6 -216. Overall, we find 183 stable ternary materials and 323 stable quaternary materials. Figure 3 shows four newly discovered stable cubic materials with their phonon dispersion curves. The CIF files of the 506 new prototype materials can be found in the supplementary file.
Some interesting features have been observed from the phonon dispersions of newly discovered materials. For instance, a couple of hundred cubic structures we have screened out possess significant but tunable phonon bandgaps (e.g., CaCO 6 as shown in Figure 3(a)). Such phonon bandgaps could lead to extraordinary hot carrier performance [49,50,51,52], which is very promising for their potential application in photovoltaics, nonlinear optics (e.g, ultrashort pulsed lasers), multi-exciton generation devices, and even photocatalysis. Large phonon bandgaps at extremely high frequencies (such as H-containing materials not shown herein) deserve further investigation for their electron-phonon coupling properties [53,54,55], which could be beneficial for designing novel superconductors. Also, there are many cubic materials possessing very soft acoustic modes, e.g., the longitudinal acoustic (LA) phonon branch in KYNbSi 6 ( Figure 3(c)), which indicate strong phonon anharmonicity and could be good candidates for waste-heat energy recovery (thermoelectrics). Last but not least, the phonon dispersion of Y 6 AlTe 6 structure exhibits a very large gradient in high-frequency optical phonon modes and thus their phonon group velocities will be very high, which could lead to a significant contribution to the overall thermal transport from these optical modes and thus unusual temperature-dependent lattice thermal conductivity [56]. To qualitatively evaluate how the new-prototype materials are structurally different from existing materials, we represent both sets of materials using simulated (for generated samples) and real X-ray diffraction (XRD) spectrum of dimension 901, which is a way to analyze the structures of inorganic materials. The simulated XRDs are calculated using the Pymatgen package [43]. We then use the t-SNE embedding approach [57] to map the samples' XRD vectors into 2D space, which are then plotted together to visualize how existing and novel materials of the same space group 216. Figure 4 shows t-SNE embedding of existing materials in the training dataset and newly discovered ABC 6 -216 materials. From Figure 4(a), we can find that new prototype materials (dark green dots) form distinct clusters, and there are apparent boundaries between known and unknown materials, which indicates that our model can generate materials beyond the scope of existing prototypes with significant structure deviations. Figure 4(b) shows a zoomed region of clusters of novel ABC 6 -216 materials, which implies that even samples of the same prototype can form structurally different clusters. For the other three prototypes, the distribution of known structures and our new-prototype structures are shown in Supplementary Figure 6-Figure 9. In Supplementary Figure 6, we find that the new-prototype ( AB 6 C 6 ) materials are mostly located at the peripheral regions of know materials clusters, indicating their structural closeness to known structures. In contrast, supplementary Figure 7 shows that materials of two new-prototypes (ABC 6 and AB 6 C 6 ) tend to form distinct clusters from known cubic materials in MP-TC3 and ICSD-TC3 validation sets, indicating their structural deviation from known materials. Additionally, for most of these new-prototype clusters, we have identified one or more DFT-verified stable materials. Supplementary Figure 8 shows that materials of new-prototype ABCD 6 form multiple new clusters, each of which contains multiple DFT-verified stable materials. Instead, materials of new-prototype ABC 6 D 6 form much fewer cluster compared to the training set OQMD-TC3.

Conclusion
Large scale generation of new materials with distinct structures and functions are highly desirable for widely used highthroughput screening based materials discovery. Faced with astronomically large structural design space (compared to the space of the chemical compositions), the generator models have to exploit the implicit sophisticated physicochemical and geometric rules and constraints embedded in the existing crystal materials. Here we propose a novel GAN-based deep generative model for large-scale generation of three major types (space groups:216, 225, 221) of cubic materials structures. Trained with 375,749 ternary cubic crystal structures from OQMD, our CubicGAN model can rediscover most of the known cubic structures as curated over more than 100 years of history within 10 million samplings.
Especially, further analysis shows that our GAN model can generate not only new materials of existing prototypes but also new-prototype materials with distinct structural novelty. In total, we have identified 24 new prototypes of cubic materials. With rigorous DFT-based relaxation and phonon dispersion calculation, we have identified and verified 506 new-prototype cubic materials, which are shared via our pubic database. From them, we have already identified several crystal structures with exceptional properties to be exploited in future. Together, our CubicGAN has demonstrated a promising path to large-scale generation and discovery of new materials.

Data Availability
All the training data are downloaded from OQMD and Materials Project website.

Acknowledgement
Research reported in this work was supported in part by NSF under grant 1940099. This material is based upon work supported by the National Science Foundation Harnessing the Data Revolution Big Idea under Grant No. 1905775. The views, perspective, and content do not necessarily represent the official views of the NSF. We thank Yuxin Li for his help on lattice parameter prediction.