Genome Mining in Glass Chemistry Using Linear Component Analysis of Ion Conductivity Data

Abstract Understanding the multivariate origin of physical properties is particularly complex for polyionic glasses. As a concept, the term genome has been used to describe the entirety of structure‐property relations in solid materials, based on functional genes acting as descriptors for a particular property, for example, for input in regression analysis or other machine‐learning tools. Here, the genes of ionic conductivity in polyionic sodium‐conducting glasses are presented as fictive chemical entities with a characteristic stoichiometry, derived from strong linear component analysis (SLCA) of a uniquely consistent dataset. SLCA is based on a twofold optimization problem that maximizes the quality of linear regression between a property (here: ionic conductivity) and champion candidates from all possible combinations of elements. Family trees and matrix rotation analysis are subsequently used to filter for essential elemental combinations, and from their characteristic mean composition, the essential genes. These genes reveal the intrinsic relationships within the multivariate input data. While they do not require a structural representation in real space, how possible structural interpretations agree with intuitive understanding of structural entities known from spectroscopic experiments is finally demonstrated.


Introduction
Relations between chemical composition, the spatial arrangement of atoms (structure), and the evolving macroscopic properties are of central interest in glass science and technology. [1][2][3][4][5][6][7] DOI: 10.1002/advs.202301435 The concentration of a single chemical element (or nominal compound) is the intuitive descriptor for starting to explore these relations. However, this simplistic approach is usually insufficient to account for material behavior in glasses containing multiple chemical elements, in particular, for properties as complex as ionic conductivity. [3,[8][9][10] A plot of a material property over the concentration of a single element may show, in this case, broadly scattered data, a vertical line, or some kind of mathematical trend with many or only a few outliers, all of which can be highly misleading. Instead, such observations might be caused by insufficient expressivity when using single-component concentration values as descriptors: multicomponent glasses usually exhibit intricate, typically unknown non-linear relationships and interdependencies involving multiple elemental species. From this, the immediate question arises as to how multicomponent descriptors can be identified for a given glass property, i.e., the glass genome. [11] Are all or only some of the elements forming the material involved in individual such descriptors? What is the molar ratio of each element in these descriptor constructions? Can the descriptors be traced back -directly or indirectly -to structural entities or building blocks known to exist in the material? In the following, we will demonstrate how these questions can be answered through analyzing composition -property correlations for strong linear components, introducing Strong Linear Component Analysis (SLCA) for deciphering glass property genomes on small datasets.
In this context, the term gene is used for functional groups of elements acting as descriptors for a particular property. Genes are presented as fictive chemical entities with a characteristic stoichiometry derived from SLCA. They do not need to be chargebalanced, and may not even require that their constituents are chemically connected, thus, they do not require a structural representation in real space (although such knowledge could be helpful in their further interpretation). As an example, three functional groups which could be responsible for ion conductivity are depicted schematically in Figure 1. All of these groups involve the gene ABC 2 as a common feature, whereas further constituents (depicted as green balls) could be any of the three elements A, B and C, or even additional, unknown components. In this schematic example, the gene of ABC 2 would have been extracted by SLCA as a fundamental descriptor for ionic conductivity in the specific type of glass. Figure 1. Schematic of the gene ABC 2 as the common feature in three different functional groups responsible for a particular glass property. Green spheres represent unnamed elemental entities, e.g., A, B, C, or D. Technically, the present method is based on an optimization problem that maximizes the R 2 from linear regression [12] between a property (here: ionic conductivity) and the sum of weighted concentrations of each constituent by tuning the weighting factors. This optimization is done for all subsets of composition -property data (that is, for all possible combinations of elements; each combination is handled as an individual model). Assisted by matrix rotation, principal component analysis (PCA, [13] ) and family trees, those models that are essential for the studied property are identified, and from their characteristic mean composition, the essential genes. Thereby, the absolute concentration of a gene is initially undetermined; other than with common regression analysis or similar machine learning approaches which are frequently applied in glass science, [14] relationships between the weighted sum of certain elemental components and a target property are not in the focus of the present study. The genes, however, may act as descriptors in physicsinformed ML models for glass property predictions.
We base our analysis on a uniquely consistent set of 56 glasses from the Na 2 O-P 2 O 5 -AlF 3 -SO 3 family, fabricated according to a uniform protocol, with exact compositional data and property measurements available from experimental study.

PCA and Dataset Dimensionality
The significance values after applying PCA on chemical composition data are listed in Table 1. By summing up the significance values of the principle components PC1-PC3, we find that three PCs are sufficient to explain 99.80% of the variations within the dataset. When only PC1 and PC2 are used, this value reduces to 90.62%. Thus, using more than three variables (genes) will probably overfit the current dataset, while using only two variables leads to underfitting. This reduction in dataset dimensionality (from 6 to 3) can be understood intuitively from the correlation between anion and cation fractions (charge compensation), mass conservation, and possible systematic correlations produced by interrelated batching (e.g., when using Na 2 SO 4 or AlF 3 for batching S or F).

R 2 Optimization
The optimized R 2 value describes the best linear relationship found in each model by optimizing the weighting vector w. The set of obtained values is provided in Table S1, Supporting Information. Results obtained for conductivity at different temperatures show very similar trends, therefore, we will initially limit the discussion to conductivity at 50°C. The overall results in Table S1, Supporting Information show that the R 2 value increases from group I (d = 1) to group VI (d = 6); for d = 6 (involving all six elements into the SLCA procedure), a maximum of R 2 = 0.919 is obtained. For increasing temperature, there is a slight decrease in the optimized R 2 values. In order to indicate the descriptive power of each model, the individual R 2 values in Table S1, Supporting Information are stated relative to the maxima found for d = 6; in the following, we will refer to these relative values.
The group I models (including only one single elemental species) have R 2 ranging from 23.2% to 72.2%; only when correlating conductivity to P, a value of 92% is obtained (whereby we assume that this is an artifact originating from the narrow range of phosphate concentrations involved in the dataset, and further from interrelated batching). A significant improvement is observed in the group II models, for which the R 2 values increase up to 99.2%. This reflects the PCA, where an increase in confidence from ≈70% to ≈90% was seen when moving from one single PC to two PCs. In Group III, the linearity further improves slightly, with most models achieving R 2 above 99.2%. Groups IV, V, and VI show no more significant improvement in the relative R 2 value, with almost all models hitting 100%. Although we are not optimizing for the maximum variance such as in the PCA, group-to-group comparison strongly resembles the results of PCA. This is because the total sum of squares in the calculation of the R 2 value is proportional to the variance of data. Finally, the decreasing R 2 with increasing temperature indicates that the conductivity depends less on composition for higher temperatures.

Weighting Vectors and Mean Compositions
The weighting vectors w{w 1 , …., w d } i (corresponding to the optimized R 2 in each model) are used to calculate the chemical formulas of conductivity genes. Optimized such weighting vectors with R 2 > 99.0% (group III; d = 3) are listed in Table 2. Mean compositions c are then calculated using Equation (8), multiplying w with the mean concentrationX of the individual elements in Table S2, Supporting Information. The components in c are normalized to the quantity of the element following the order of priority O>Na>P>Al>F>S (ascending with the ratio Std./Mean in Table S2). The most stable element is always used as a reference in normalization. The elements Table 2. Essential genes with their mean stoichiometry and weight vectors w, general compositions and R 2 values. Signs on the weight factors are chosen so that positive factors signify increasing conductivity. The components of the weight vector are listed in the same order as in the modelX representation. Data are provided for conductivity at 50, 150, and 250°C. essential gene gene ( 50°C ) general composition ( 50°C ) R 2 ( 50°C ) g e n e ( 150°C ) R 2 ( 150°C ) g e n e( 250°C ) R 2 ( 250°C )  Table 2 for gene parameters).

Gene 1 [Na/O/S]
Na, O, and P are relatively more stable than that S, F, and Al. The signs of components in c are forced positive. In this way, c represents the stoichiometry of genes underlying the present dataset. The practical consequence of this procedure is illustrated by way of example in  of the gene is undetermined. The standard deviation represents the uncertainty of gene stoichiometry, which we express in the general composition of each gene (see Table 2).

Determination of Essential Genes
As a crucial step in the SLCA, the set of models needs to be filtered for essential genes which represent neither underfitted nor overfitted models. From the PCA, we know that the hyperparameter n d should not exceed 3, otherwise, the model is overfitted. However, this information alone is not sufficient to identify the essential models. The R 2 values in group III still vary from 54.0% to 99.8%, meaning that some models are underfitted while others are overfitted. Removing the models with low R 2 to avoid underfitting within group III is straightforward. Excluding overfitted models is less obvious. Using R 2 alone cannot www.advancedsciencenews.com www.advancedscience.com distinguish between over-fitted and just well-fitted models. Therefore, an additional criterion is applied on the models of group III, whereby a model is considered overfitted when R 2 is high, but not rooted in group hierarchy: adding one more element from group to group must improve model linearity. Figure 3 shows two families originating from S and P, selected by way of example for their highest R 2 among the group I models.
On the second level of the family trees, only those models (group II) are considered descendant from S (or P) which (i) include S (or P) and (ii) have higher R 2 than their parent. On the third and any higher level, in addition to (i-ii), the molar ratio and qualitative contribution (signified by a positive or negative sign) of the parent elements must also reflect in the child's model stoichiometry (iii). For example, all of the five possible group II descendants of S have increased R 2 values, therefore, they are all children of S. Each of these five children has four further descendants in group III, however, the majority of those second-level descendants is excluded because of the stoichiometry criterion (iii), or because of an insignificant increase in R 2 over the group II ancestor (indicating overfitting). As a result, we identify four essential genes, www.advancedsciencenews.com www.advancedscience.com and a limited number of uncertain models (for which exclusion based on deviating stoichiometry is less clear; the main reason for such uncertainty is with the limited size of the dataset relative to the strong variation in the elemental concentration of S). The same procedure is carried out for element P and its group II and group III descendants. Here, models [P/F] and [P/Al] achieved a relative R 2 of 99.2% already in group II; for symmetry reasons, they both converge in group III [P/F/Al] with a slight further increase in R 2 . Overall, two additional, unique genes are identified in this way, whereas the [Na/P/S] model recurred in both the S and the P families. The six essential genes are summarized in Table 2.
When conductivity is correlated to glass composition using the gene weighting factors, highly refined linear correlations are obtained, see Figure 2b; the composition of the essential genes represents the internal correlation among the chemical components in determining ionic conductivity. In Figure 2, the essential models [Na/S/O] and [P/F/Al] are used for this demonstration. They do not contain any common element and, consequently, seem to conjugate in their contribution to ionic conductivity.
Five of the six essential genes involve Na; expectedly, it is the most important element in increasing the ionic conductivity. Beyond Na, S is the second most frequent species, therefore, we will focus our discussion on the structural meaning of the identified genes on these two elements. In addition, we find that the Na contribution within increases consistently with temperature, in each of the genes in which it is present. We will therefore close the following discussion with a perspective on the temperaturedependence of Na mobility.

Structural Significance of Essential Genes
Combining gene composition and bonding information can potentially reveal the structural significance of the identified genes. In order to explore this aspect, we first interpret each element as a structural entity: [6][7][8]16] S as a sulfate tetrahedron, P as a phosphate tetrahedron, Al as an alumina octahedron, F as a terminal P-F or Al-F bond, or as an Al-F-Al link, and Na as Na + ions attached to F terminal bonds, non-bridging oxygen species, or isolated sulfate tetrahedra. Assumed charge carriers are Na(+) [17,18] and, eventually, F(-) and O(2-). We now consider each gene separately, using the general stoichiometry (taking into account the standard deviation found per gene for the contributions of S, F, and Al; the gene itself is specific, however, uncertainty arises from the underlying dataset). In Figure 4, possible structures are depicted for each gene at the extremes of its respective general stoichiometry.
On the upper end of its general formula, the [Na/S/O] gene involves two Na+ ions per one SO 4 2− tetrahedron, that is, a sodium sulfate group (Figure 4a, left). Lower S signifies an excess of Na + , charge compensated by non-bridging oxygen species (i.e., up to 3 Na and 6 O when S is at its minimum value of 0.09). A possible structural representation of this is depicted in Figure 4a (right), where a sulfate tetrahedron is surrounded by the additional oxygen species in a mix of bridging and nonbridging configurations between network-forming polyhedra, and charge-compensating the additional Na + ions. In effect and although the exact chemical environment is uncertain, the structural role of the [Na/S/O] gene, therefore, is that of an SO 4 bridge in an Na-rich environment. This agrees with the observation of the conjugate [P/F/Al] gene discussed in the previous section. The bulk conductivity then depends on the frequency of these ionic bridging entities, which -assumedly -would interact to form ion transport channels.
The normalized extreme cases of the [Na/P/S] gene are Na 3P3 S and Na 10P10 . Both formulas maintain a fixed Na:P ratio of 1. This stoichiometry reflects the interaction between a network former (P) and a network modifier (Na), bridged by sulfate entities. It involves phosphate groups with 3 to 10 tetrahedra per sulfate ion. Compared to the [Na/S/O] gene, [Na/P/S] is more specific in terms of representing similar sulfate bridges, but now within a defined range of superstructural network arrangements (Figure 4b).
NaS 0.13−0.35 F 0.06−0.17 and NaS 0.11−0.30 Al 0.08−0. 16 (3) In [Na/S/F] and [Na/S/Al], the phosphate group is replaced by F and Al, respectively. Again, both genes represent SO 4 bridging units, however, at the -P-F-Al-phosphate junctions. The S:F ratio in [Na/S/F] is fixed at 2:1, indicating that a sulfate bridge involves two sulfate tetrahedra and one terminal F (Figure 4c, left). The [Na/S/Al] gene has an S:Al ratio varying from 2:1 to 1:1, meaning that in this case, the sulfate bridge consists of one or two sulfate tetrahedra (Figure 4c, right).

Temperature Effects
The SLCA was carried-out for conductivity data collected at 50, 150, and 250°C ( Table 2). The [Na/P/O] and [P/F/Al] genes are practically insensitive to temperature within this range, indicating structural stability below T g . For all other genes and, in particular, [Na/O/S], the relative contribution of Na increases notably with increasing temperature. At the same time, the weight of P in [Na/P/S] decreases. These observations reflect that temperature affects primarily the sulfate bridge in its strength to localize Na ion species. With increasing temperature, the sulfate entity interacts with less and less phosphate groups and, in turn, assembles more Na in its vicinity. Overall, it is not surprising that the gene stoichiometry is temperature dependent, given the multitude of interaction potentials within multi-component glasses such as the present ones.

Conclusion
In the context of the material genome, genes are functional groups of elements acting as descriptors for a particular property. Studying ionic conductivity in glasses from the Na 2 O-P 2 O 5 -AlF 3 -SO 3 family, we presented them as six fictive chemical entities with a characteristic stoichiometry derived from strong linear component analysis. SLCA maximizes the quality of linear regression between a property (here: ionic conductivity) and champion compositions from all possible combinations of elements. Family trees and matrix rotation allow for the identification of essential genes, which are filtered from the set of all possible combinations of elements present in the considered dataset. Figure 5. Schematic of the gene extraction method using SLCA. SLCA stars with a linear regression and optimization for maximum R 2 of composition -property correlations for all possible combinations of elements, based on the weighted sum of elemental concentrations (leading to 63 individual models). Optimized weighting factors and R 2 values are then used to construct family trees, and models are filtered by PCA and physical ancestry. From this, essential models are obtained, whose mean compositions are referred to as essential genes.
While such genes do not require a structural representation in real space, we finally demonstrated how possible structural interpretations agree with intuitive understanding of structural entities known from spectroscopic experiments.

Experimental Section
General: A general schematic of the approach used to extract genes for glass conductivity is shown in Figure 5. We initially constructed 63 linear regression models according to all possible subsetsX of independent variables (elemental constituents) in the input matrix X (Weight Vector Optimization). The choice of linear models reflects Occam's razor, but it is not unique: other model orders could have been chosen for higher complexity. A weighted sum of the independent variables inX is generated in each linear regression model, yielding a total R 2 feedback value as a function of the individual weighting factors used in the model for each variable. By tuning the weighting factors, R 2 is maximized for a particular model; the optimized weighting factors and the optimized R 2 are the output values of this procedure. All optimized models are subsequently categorized into groups according to the number of independent variables inX. PCA is used to filter for groups that are neither underfit nor overfit (Principal Component Analysis). Family trees are then constructed, in which models are probed for ancestry by number of components, and unphysical models are removed. The remaining models are referred to as essential models, and their mean compositions are the essential genes.
Dataset: As proof of principle, we focus on gene extraction for the (sodium) ion mobility in glasses of the Na 2 O-P 2 O 5 -AlF 3 -SO 3 family. For these glasses, a highly consistent dataset including 56 individual samples with variable chemical composition is available from our lab, together with a range of physical and spectroscopic data. [6][7][8] In order to improve the accuracy of chemical composition data over previously available energy-dispersive X-ray spectroscopic results (EDX), [15] all glasses were reanalyzed by fully quantitative wavelength-dispersive X-ray spectroscopy (WDX) using a JEOL JXA8800L microprobe analyzer. Generally, compared to standardless EDX the WDX technique provides improved energy resolution, detection limit, and more precise results, in particular, for the lighter elements such as F (the elemental composition obtained in this way and the previously reported ionic conductivity are the (X, Y) input in Figure 5).
All samples were coated with a layer of approximately 5 nm of carbon before WDX measurement to avoid charging effects under the electron beam. In order to minimize beam damage of the sensitive Na 2 O-P 2 O 5 -AlF 3 -SO 3 glasses the count rate of the elements in question was initially recorded as a function of beam time, and the excitation conditions were optimized accordingly. As a result, the energy of the exciting electron was set to E 0 = 10 kV, and a moderate beam current of 30 nA was applied. The beam diameter was defocused to 100 μm. As diffractive element a TAP crystal (2d: 25.757 Å) was used for the detection of the KL 3 radiation of F (677 eV), Na (1040 eV), and Al (1486 eV), and a PET crystal (2d: 8.742 Å) for P (2010 eV) and S (2309 eV). All lines were recorded over a time of 60 s (peak) and 2 × 30 s (background), respectively. These times were chosen to guarantee sample stability under the electron beam; for observation times of 120 s, we started to see dynamic variations in the Na ion concentration.
For each glass, the SLCA was carried-out three times, using conductivity data at three different temperatures (50, 150, and 250°C) so as to obtain information on the possible effect of temperature on gene prevalence. The full dataset is provided in the supplementary Table S2.
Weight Vector Optimization: Vector optimization was conducted using linear fits of the composition -conductivity relationship to obtain weight factors for the contribution of individual elements to ionic conductivity. For an n-dimensional (number of elemental constituents) multivariate dataset X with m observations (number of composition datasets), and another m × 1 dataset y with m observations (number of ionic conductivity datapoints), the SLCA searches for the optimal d × 1 vector w = [w 1 ,w 2 …w d ] T (gene receipt) such that is maximized (thereby, d is the vector dimensionality, 1 < d < 6). The R 2 value ranges from 0 to 1, representing poor (R 2 → 0) and perfect linearity (R 2 = 1). y i andȳ are the ith observation and mean value of y, respectively. f i is the predicted value from the best linear fit between the m × 1 vector x = Xw and the m × 1 vector y. x is the weighted sum of elements in X, with weights w.
where a and b are the slope and intercept obtained via minimizing the least square error, respectively, There are two optimization processes in SLCA. One is to minimize the least square error (Equation (7a)) in order to obtain the slope and intercept for the best linear fit, and the other one is to obtain the weighting factor w by maximizing the R 2 value based on Equation (7b). Since w is a d × 1 vector, we have d − 1 fitting parameters (because of normalization, | m ∑ i = 1 w i | = 1, therefore, |w 1 | = 1 when d = 1). The remaining fitting parameters are initiated 1000 times randomly between the lower and upper bounds of -1 and 1, respectively. The two optimization problems form one single model ( Figure 5) with w and R 2 as output. SLCA than explores all possible subsetsX of X without requiring any filtering procedure (from prior knowledge) before data input. In the present case, six elements are considered, n = 6. This results in C 1 6 + C 2 6 + C 3 6 + C 4 6 + C 5 6 + C 6 6 = 6 + 15 + 20 + 15 + 6 + 1 = 63 possible elemental combinations (with the subscript n, and the superscript d, the number of constituents partaking in the specific model; e.g., C 1 6 could represent a model correlating all data to Na concentration alone, whereas C 6 6 is the model in which all elements are considered in the fitting process). Each such combination represents one individual model for which the above optimization is done using the full dataset of 56 samples. The weight vector w therefore varies in its dimensionality d, corresponding to the model dimensionality ofX. In the following, the models will be grouped according to their dimensionality, from group I to group VI.
The individual weighting factors w i in the vector w represent the contribution of element i to the target property, here, ionic conductivity. The sign of w i is defined by forcing the slope a in Equation (6) to be positive, which is equivalent to requiring monotonically increasing proportionality for the element i with positive w i and vice versa.
The weighting factors alone are not suitable for representing genes in terms of (fictive) chemical formulas because they are not given in molar ratios. Therefore, we transform to c instead, with c i = w i X i (8) where X i is the mean of the ith chemical element in subsetX. In this way, genes are represented in the form of c. Principal Component analysis: PCA is used to analyze the entire dataset in order to reveal its minimum dimensionality. The principal components v of the m × n dataset X are obtained by solving the eigenvalue equation where i is the ith eigenvalue of the corresponding eigenvector v i (ith principal component). C is a n × n matrix, the covariance matrix: Cov (X 1 , X 1 ) ⋯ Cov (X 1 , X n ) ⋮ ⋱ ⋮ Cov (X n , X 1 ) ⋯ Cov (X n , X n ) where Cov(X i ,X j ) is the covariance of two variables X i and X j , which are the ith and jth element/column of X: where E is the expected value operator, and X i and X i are mean of variables X i and X j , respectively. The PCA initially generates the same number of principal components as there are dimensions in the input dataset. It then reduces dimensionality by removing those components which have low i . The remainder is the principal components; their number is the minimum dimension of the dataset. In effect, PCA conducts a matrix rotation aiming to identify correlations in the raw dataset which would reduce its dimensionality. Before conducting PCA, the raw dataset X requires standardization, The standardized Z i (centered at 0 and scaled by forcing its variation to 1) now replaces X i in Equations (9)- (11) so that the PCA is not biased in any chemical element i with high absolute magnitude in X i . To keep the coefficients in principal components v orthonormal, the components v i in v are rescaled by the standard deviation of the corresponding where D is the n × n diagonal matrix with n diagonal elements equal to the inverse of standard deviations of n variables/column, X 1 , X 2 …and X n . Finally, the orthonormalṽ are obtained (shown in Table 1 as PC1, PC2 … PC6, following in ascending order their Eigenvalues 1 > 2 > 3 > 4 > 5 > 6 ). For example, the first principal component for our nominal data is (with T representing the transpose)

Supporting Information
Supporting Information is available from the Wiley Online Library or from the author.