Principal component analysis and dimensional analysis as materials informatics tools to reduce dimensionality in materials science and engineering



In engineering design, we are constantly faced with the need to describe the behavior of complex engineered systems for which there is no closed-form solution. There is rarely a single multiscale theory or experiment that can meaningfully and accurately capture such information primarily due to the inherently multivariate nature of the variables influencing materials behavior. Seeking structure-property relationships is an accepted paradigm in materials science, yet these relationships are often not linear, and so the challenge is to seek patterns among multiple length and time scales. In this paper, we present two separate but complementary examples of addressing the issue of high-dimensional data in materials science in the spirit of the intellectual focus of this new journal. The first example uses principal component analysis and the second example uses statistical analysis coupled to dimensional analysis. Copyright © 2009 Wiley Periodicals, Inc., A Wiley Company


The first materials informatics workshop defined the field as “the high speed robust acquisition, management, analysis, and dissemination of diverse materials data”.1 Materials informatics responds to the need for faster development times for new materials and to the unprecedented amount and complexity of materials information resulting from modern modeling and experimental techniques. Traditional techniques of analysis fall short in satisfying these needs, and new approaches are being developed2.

The widely agreed central paradigm of materials science and engineering revolves around understanding the relationships between processing, structure, properties, and performance for each material studied. The enormous complexity in studying materials can be traced to the uncommonly large number of variables involved in the relation between any two of these features. Compounding this challenge, we find that it is impossible in practice to uncover all variables, and the theoretical and experimental limitations of traditional approaches sometimes fail to uncover even the dominant variables. When important variables have not been identified, there are unexplained exceptions to what is believed to be understood, and in some cases, there is a lack of understanding altogether. This is the niche that materials informatics intends to fill. Materials informatics can also make sense of the vast amounts of materials data that is now available. This “knowledge extraction” includes the identification of outliers in the data, the development of models, pattern recognition, and forward and reverse data mapping1.

Materials informatics spans a broad range of tools ranging from systematic, combinatorial experimentation to sophisticated modeling. The modeling efforts can be divided into two main types: “hard modeling” and “soft modeling”.3 Hard modeling encompasses computational strategies involving advanced discretization, parallel algorithms, and a software architecture for distributed computing systems. Among these approaches are atomistic models and ab initio calculations, thermodynamic modeling, phase field simulation, and finite element modeling at a microstructural level. Soft modeling was first introduced by the life sciences and organic chemistry community, and it relates to statistically based, model-independent approaches. Among these approaches are the uses of regressions, neural networks, genetic algorithms, classification algorithms, principal component analysis (PCA), partial least squares, and other data mining techniques.

This paper aims to introduce to the statistical and data mining community two soft modeling approaches, one designed to study the structure-properties aspect of the materials paradigm and the other designed to study the processing-properties aspect. Other techniques introduced in this special issue address relationships involving the other aspects of the materials paradigm.

One of the earliest soft modeling efforts to address the challenge of excessive variables between materials properties was the one done by Ashby, who showed that by merging phenomenological relationships in materials properties with discrete data on specific materials characteristics, one can begin to develop patterns of classification of materials behavior4. The visualization of multivariate data was done using normalization schemes that permitted the development of “maps”, which provided new means of clustering of materials properties. As an example, one such map is shown as in Fig. 1. Ashby's approach also provided a methodology to establish common structure–property relationships across seemingly different classes of materials. This approach is valuable, but still presents difficulties as a predictive tool based on prior models for building and seeking relationships. In the “informatics” approach to studying materials behavior we approach it from a broader perspective. By exploring all types of data, such as crystallographic, electronic, and mechanical data over a wide range of materials, that may have varying degrees of influence on a given property(ies), and with no prior assumptions, we use statistical model estimation, predictive learning, and data mining techniques to establish both classification and predictive assessments in materials behavior5. The innovative aspect of these techniques is that the statistical approaches employed are enhanced by including the basic physics of the problem; for example, requiring that the predictions made have meaningful units.

Figure 1.

An example of data integration in materials engineering mapping correlations between mechanical properties over a wide range of materials: fracture toughness and modulus for metals, alloys, ceramics, glasses, polymers, and metallic glasses. The contours show the toughness Gc in kJ/m2 (from4).

As noted by Searls6, understanding the relative roles of the different attributes governing systems behavior is the foundation for developing models (Fig. 2). Materials design is a process that helps us determine the optimal combinations of material chemistry, processing routes, and processing parameters to meet specific performance requirements robustly such as mechanical properties and corrosion resistance. This process is iterative by nature due to the incompleteness of design knowledgebase and the lack of one-to-one correspondence in this inverse problem; that is, an effect can be the result of many different causes.

Figure 2.

Understanding complexity in a database structure. (a) Data can be modeled as a set of entities (circles), which are general classes of objects, such as genes and compounds. Associated with entities are attributes (squares), which comprise features or properties such as molecular mass. Attributes take on particular values, and each entity can then correspond to a table in a database, so that the model specifies a schema for that database. (b) When integrating multiple data domains, attributes can collapse on entities that are recognized as identical or even similar, which increases dimensionality in the sense of “arity” or number of attributes, and creates a variety of technical challenges. (c) Integration can also result in recognizing new relationships (arrows) among distinct entities, which increases dimensionality in a different sense, that of connectivity. Increasing connectivity can simply result in a proliferation of data, but at the level of classes of entities in underlying data models, additional connections increase the complexity of those models and resulting database designs (from6). Please refer to the online version for color legends.

Of the two soft modeling approaches presented in this paper, Section 2 is based on principal component analysis and uses hundreds of data points (in this case, crystal chemistries) to seek patterns between crystal structure and chemical properties that otherwise would be difficult to achieve. The second approach described in Section 3 applies a set of computational strategies to represent processing and properties of data at a system level within a unit-consistent framework, which also allows for a reliable pruning of secondary effects.


PCA is a projection technique that can be used to handle multivariate data that consists of many interrelated variables obtained from an experiment or from a well-organized database. By reducing the information dimensionality in a way that minimizes the loss of information, PCA finds uncorrelated axes that create hyperplanes. Multivariate datasets can be visualized through hyperplanes spanned in multidimensional space.

PCA relies on the fact that most of the descriptors are intercorrelated and these correlations in some instances are high7. From a set of N correlated descriptors, we can derive a set of N uncorrelated descriptors [the principal components (PCs)]. Each PC is a suitable linear combination of all the original descriptors. The first PC accounts for the maximum variance (eigenvalue) in the original dataset. The second PC is orthogonal (uncorrelated) to the first and accounts for most of the remaining variance. Thus, the m-th PC is orthogonal to all others and has the m-th largest variance in the set of PCs. Once the N PCs have been calculated using eigenvalue/eigenvector matrix operations, only PCs with variances above a critical level are retained. The M-dimensional PC space has retained most of the information from the initial N-dimensional descriptor space by projecting it onto orthogonal axes of high variance. The complex tasks of classification or prediction are made easier in this compressed space.

PCA Methodology

Following the treatment of Johnson and Wichern8, we can describe this mathematically as follows. Consider p random variables X1, X2, …, Xp. The original system can be rotated and a new coordinate system obtained with the new axes representing directions with maximum variability. The new axes, which are linear combinations of the original variables, are the PCs. Let equation image be the covariance matrix associated with the random vector equation image. The corresponding eigenvalue–eigenvector pairs of equation image are equation image where λ1 ≥ λ2 ≥ …≥ λp ≥ 0. Then, the i-th PC is given by

equation image(1)

Then, the variance and covariance are given by

equation image(2)
equation image(3)

Thus, the PCs are uncorrelated from Eq. (2) and have variances equal to the eigenvalues of Σ from Eq. (3). The trace of the covariance matrix can be expressed as:

equation image(4)

where σii is a diagonal element of Σ.

Then the total population variance due to the k-th PC is

equation image(5)

Consequently, if most of the total population variance, for large p, can be attributed to the first two or three components, these can replace the original variables with a minimal loss of information. When the variables have different ranges and are measured on different scales (as is the case with most materials problems), they are standardized.

equation image(6)

Here, Z is the standardized variables, V1/2 is the diagonal standard deviation matrix, sii is the variance, and µ is the population mean of each variable. Then cov (Z) = ρ and var (Zi) = 1 and the PCs of Z are obtained from the eigenvectors of the correlation matrix ρ of X. The corresponding eigenvalue–eigenvector pairs for ρ are equation image where λ1 ≥ λ2 ≥ …≥ λp ≥ 0. The i-th PC is given by

equation image(7)

The proportion of total population variance due to the k-th PC is λk/p where k = 1, 2, …, p and λk 's are the eigenvalues of ρ.

In short, PCA reduces the redundancy contained within the data by creating a new series of components in which the axes of the new coordinate systems point in the direction of decreasing variance. Geometrical explanations of PCA are available from the literature9.

Application of PCA

A fundamental challenge in materials science is the discovery of new materials. One prominent example is the discovery of high temperature superconductors. Since the discovery of high temperature superconductors two decades ago, there has been a significant revival to the question of how do we discover the next high temperature superconductors? A fundamental question is whether one can identify patterns of behavior that might provide clues as to which factors might govern this important property. Figure 3 shows a three-dimensional PCA plot of a multivariate database of high temperature superconductors. The initial dataset consisted of 600 compounds with data on five different attributes or variables associated with each compound. The data for high temperature superconductors was collected from the literature10–26. The PCA analysis shown in Fig. 3 indicates that no clear pattern emerges (from visual inspection at least) unless one looks at the PC2–PC3 projection; where a clear clustering pattern is seen (right plot of Fig. 4).

Figure 3.

PCA results with superconducting data. By inspection with original dataset, the linear clusters from the PC2–PC3 projection were found to be associated with systematic valency changes among the compounds studied. It should be noted that data association has to be a deliberate and careful process of understanding which data was inputted and how to interpret from these patterns. Automatic methods of unsupervised data interpretation are possible and that is where data mining methods become very valuable27. As explained in the PCA methodology section, the percentages along each axis represent the variance of the data captured by each PC. Thus, three PCs contain 80.98% of the variance of the data. Please refer to the online version for color legends.

Figure 4.

(left) Bivariate plots of Matthias profile of Tc (transition temperature) versus valency (Nv) for three different superconducting regions (A, B, and C) suggested by Villars and Phillips28. Please note that a superconductor is a material that will conduct electricity without resistance below a certain temperature, the transition temperature (Tc). (right) PC2–PC3 score plot. Comparison of bivariate data plots showing linear clustering of superconducting transition temperature with valency changes (Nv) with a multivariate data plot PC projection showing the same trends. Thus, many high Tc superconductors generally sit in the middle range of valency. Every data point in the PCA plot represents a compound with all the attributes embedded. Please refer to the online version for color legends.

All the data points refer to different compounds and their spatial position indicates how they relate to each other when defined simultaneously by all the latent variables or parameters that may be used to characterize or influence their superconducting behavior. It can be seen that the interpretation of this three-dimensional projection of information of a five-dimensional dataset (i.e. five variables or descriptors for each compound) becomes more meaningful by examining one of the two projections of this eigenspace. By systematic association of trends in transition temperature with the original datasets, it was discovered that each linear cluster was associated with a given “average valency”—a term proposed by Villiars and Phillips nearly two decades ago based on Matthias profile (left plots in Fig. 4] 28.

In other words, all the compounds in the score plot (right plot in Fig. 4) cluster according to the “average valency” hypothesis. Given that the data used by Villiars and Phillips just precedes the discovery of ceramic superconductors, our work, which includes the data since then, demonstrates the broader impact of the valency clustering criterion in superconductors. Figure 4 also helps to provide a physical interpretation of trajectories in PCA space; for instance, PC2 and PC3 represent orthogonal projections of the “valency vector” effect on high Tc superconductors as it is approximately at 45° to the PC axes. While the PC axes themselves in this case do not have a discrete physical meaning, other directions in the PC projections can. We will later show examples of how the PC axes can correlate directly to trends in a physical parameter.

The loading plot (Fig. 5) suggests that the cohesive energy does not play a major role in discriminating among compounds in terms of the linear clustering projection, as it is near the origin (0, 0) position of the loading plot. However, parameters such as pseudopotential radii are a dominant factor, whereas valency and ionization energy play an important but lesser role. The loading plot also shows the noticeable negative correlation between ionization energy and pseudopotential radii in terms of their influence on average valency clustering of the compounds as they reside in opposite quadrants.

Figure 5.

Loading plot for high Tc superconducting dataset. Every point is an attribute and their spatial correlations suggest the strength of the relative correlations between attributes on the compound dataset. As a quantitative comparison to the scoring plot above, the PC equations as derived from the eigenvalue analysis yielded: PC2 = 0.89Nv + 0.01X–0.40R + 0.03C + 0.22I and PC3 = 0.41Nv + 0.41X + 0.76R–0.06C–0.30I. Note that for PC3, the weighting coefficients for valency and electronegativity are the same (0.41) and shown on the graph above. Also the weighting coefficients for cohesive energy are very small (0.03 and − 0.06) for both PCs, suggesting that this attribute plays a very small role in defining relative to the other attributes or latent variables.

As shown in Fig. 6, we can further gather more information by juxtaposing information from both scoring and loading plots and by using different visualization schemes.

Figure 6.

Interpreting results of a scoring plot. The dashed curve shows the trend in transition temperature across the linear clusters. The peak corresponds to those with the highest recorded temperatures. By comparison to the earlier scoring plot, and using a different visualization scheme, we can now capture a more complete perspective of trends in this multivariate dataset. It is interesting to note that the more recent discovery of MgB2 as a high temperature superconductor actually shows up in this plot indicating how this multidimensional analysis appears to capture the critical physics governing high Tc materials. Please refer to the online version for color legends.

The strong effect of valency on the linear pattern of clustering in the scoring plot is consistent with its large distance from the origin on the loading plot. In the above figure, we have presented the same scoring plot as before, except now each compound is labeled according to structure type rather than transition temperature. The highest transition compounds are the cupric oxides (marked in light orange).

To summarize, when we start with a multivariate data matrix, PCA permits us to reduce the dimensionality of that dataset. This reduction in dimensionality now offers us better opportunities to:

  • •Identify the strongest patterns in the data.

  • •Capture most of the variability of the data by a small fraction of the total set of dimensions.

  • •Eliminate most of the noise in the data, making it beneficial for both data mining and other data analysis algorithms.


Here we present an algorithm that combines a linear regression model of the experimental data with physical considerations of the process; namely, the units of the resulting model match the units of the dependent variable. We look for the power law model that minimizes the prediction error only among models that have the correct units. The output of the algorithm is a physically meaningful and simple power law, representing the process and a set of dimensionless groups ordered by their relevance to the problem. The user input in selecting the simple model, and the ability to correct it further using the dimensionless groups, provide the means to construct a model that achieves the desired balance between accuracy and simplicity.

Dimensional analysis (DA) was formalized by Buckingham29 with his “Pi theorem”. DA is a standard tool used by many engineering and science disciplines. In essence, DA reduces the number of parameters in a problem by considering their units. For example, many fluid mechanics problems involve four parameters (viscosity, density, length, and velocity), which, by using this technique, can be reduced to a single dimensionless parameter (the Reynolds number). Thus, DA can significantly reduce the number of experiments necessary to characterize a system.

While the application of DA to relatively simple systems is well understood, when a problem involves many parameters, DA typically yields too many dimensionless groups and so do not enhance understanding or intuitive interpretations. This problem of excess number of dimensionless groups is more often the rule rather than the exception when modeling structure–properties relationships in materials or materials processing operations. The difficulty in implementing the DA in the field of materials science and engineering is the main reason why there is a relatively small use of DA in these fields. We will use the materials informatics approach to select the most representative dimensionless groups in order to reduce the dimensionality of the problem.

Previous efforts to use DA to reduce the number of adjustable parameters in regressions were pioneered by Li and Lee30, Dovi et al.31, and Vignaux32, 33. The limitations of using regressions based on dimensionless groups have been discussed by Hicks34 and Kenney35. The Artificial Intelligence community has also provided interesting results from statistical data analysis to “rediscover” governing laws. Important examples of this research are the algorithm BACON36, algorithms ABACUS37 and COPER38, and recent work by Washio and Motoda37, 39.

The materials informatics approach that we are pursuing combines elements of DA and elements of regressions and statistics. This approach differs from previous statistical approaches in four main aspects. First, the dimensionless groups employed are generated by an algorithm instead of being postulated a priori; second, it does not require integer exponents in the scaling laws; third, it allows for datasets in which variables change value simultaneously; and fourth, it explicitly searches for the simplest predictive formulation using a heuristic formulation. DA generates results in the form of power laws. Power laws yield estimates in the form of a function of the problem parameters raised to constant exponents. For example, if L is a characteristic value of length in the x direction, La is a power law, while xa is not.

Power laws are ubiquitous in engineering and science and are especially appealing to materials informatics because they can provide estimations for a whole family of systems. A power law developed for a particular materials system is also applicable to all other materials that respond to the same dominant physics. This way, outliers can be readily identified indicating either errors or physical phenomena that had been disregarded but were relevant for that outlier. Another convenient feature of power laws is the simplicity of reverse mapping. Because of their mathematical form, scaling laws are especially good for capturing nonlinear phenomena40.

We have implemented the combined approach of DA, regression analysis, and dimensionality reduction into an algorithm we called SLAW (for Scaling LAW)41. SLAW is an algorithm designed to generate power laws from statistical data. SLAW focuses on problems where many parameters are present and a dominant subset of parameters must be identified. To obtain the resulting power law, SLAW performs a sequence of multivariate linear regressions based on the logarithm of the parameters and target quantity.

SLAW also uses heuristic considerations or assumptions, that is phenomenologically based correlations expected from known structure–property relationships in materials science. The first assumption is that the target quantity can be captured by a power law. As discussed earlier, this is generally a good hypothesis. This also requires that the data being analyzed correspond to the same dominant physics. The second assumption is that the optimal dimensionality reduction can be identified by a sudden jump in error in the process. A third assumption is that the exponents of the resulting power laws can be rounded to the nearest ratio of small integers. This reduces the effect of experimental noise in the results, and allows reproducing trends of the closed-form exact solution.

SLAW Methodology

The SLAW algorithm is based on backwards elimination of a constrained linear regression in the logarithmic space. If Y is the target magnitude that we want to model, DA states that it can be represented exactly as:

equation image(8)

where Xj are the parameters that completely describe the system, a0j are real constants, f is a function of the dimensionless groups, Πi are dimensionless groups based on the parameters, n is the number of parameters, and m = nk, with k being the number of independent units involved in the variables (e.g.: m, kg, s, etc.). The dimensionless groups have the expression

equation image(9)

If we approximate function f as a power law of the dimensionless groups, the target quantity Y can then be represented as

equation image(10)

The constrained linear regression is performed in the logarithmic space

equation image(11)

where β0 = lnb0 and equation image captures the uncertainties of the representation. Additive errors in the logarithmic space is a common assumption in fitting scaling laws30, 31 and it is justified by Benford's law42, 43, which states that variations of physical quantities are evenly distributed in a logarithmic scale.

Considering p experimental observations of the physical process, we obtain estimators of the βj coefficients using standard linear regression techniques. We denote the p observations of the target magnitude Y by y1, ..., yp, and the observations for the j-th independent variable Xj by x1j, …, xpj. We assume independent experimental observations, which imply that the observed errors ε1…εp are independent identically distributed (IID) random variables. Using matrix notation, we have

equation image(12)

The estimate for the coefficients in model that minimizes the residual sum of squares is the solution to the system of normal equations Tβ = . We denote this estimate by the n + 1 dimensional vectorequation image, and the estimate of the target magnitude becomes

equation image(13)

The estimate equation image, however, will generally not satisfy the units constraint. In the logarithmic space, dimensional homogeneity becomes a linear constraint of the form Rβ = b, where matrix R is q by n + 1 such that Rij is the exponent of reference unit i in the units of variable Xj, for j = 0, 1, …, n, and b is a q-dimensional vector such that bi is the exponent of reference unit i in the units of the dependent variable Y.

Therefore, the best power law that is dimensionally consistent corresponds to the coefficients that minimize the residual sum of squares while satisfying the units constraint:

equation image(14)

To represent the units constraint in a linear form, assume that q reference units (e.g. m, kg, s, etc.) are the building blocks for the units of the dependent and all independent variables in the problem. Further details of the mathematical implementation of SLAW are included in41, and a prototype algorithm can be downloaded from44.

Algorithm SLAW can be broken down into four steps:

  • 1.Find the sequence of models {βk} through a backward elimination process that solves the constrained optimization problem.
  • 2.Determine (with user input) which model of the sequence {βk} to select, let's call itequation image.
  • 3.Round the coefficients in equation image, obtaining a physically meaningful simple modelβ*.
  • 4.Perform backward elimination again to identify dimensionless groups in what is not explained byβ*.

Step 4 is needed to identify the correct dimensionless groups, because the rounding procedure in step 3 creates a model that is slightly different from the model derived in the original regressions.

The systematic reduction of degrees of freedom in SLAW helps avoid the problem of “overfitting” in which the regressions reproduce the input data, but have a poor performance with new data45–47. SLAW shares some similarities with PCA48 in its goal of discriminating dominant variables from secondary ones. A significant difference between SLAW and PCA is that the former simplifies the system by eliminating physical parameters, while in PCA each component can potentially involve all parameters. Thus, the simplifications from SLAW are not orthogonal in the sense of PCA; however, the advantage of the approach of SLAW is that by eliminating physical parameters, less information is necessary to make estimation. Once the dominant set of parameters has been established, estimations can be performed with very coarse knowledge about the secondary parameters, enabling the use of incomplete datasets. A minimum of knowledge of the secondary parameters is still necessary in SLAW to ensure that the intended estimations correspond to the proper regime.

A good example of the implementation of SLAW to the field of materials science is the analysis of residual strains in ceramic to metal bonding, which has been already studied with conventional methods in49. In this problem, the goal is to find a general expression to estimate the total amount of residual elastic energy in the ceramic, which is a good metric of performance of the joint. In this case, the target quantity is the total elastic strain energy in the ceramic U (Pa m3), and the independent variables are the elastic modulus of the ceramic Ec (Pa), the elastic modulus of the metal Em (Pa), the yield stress of metal σY (Pa), the radius of the face of the ceramic and metallic cylinders being joined r (m), and the differential thermal shrinkage between ceramic and metal εT (dimensionless).

The sequence of models and corresponding errors at each step of SLAW are illustrated in Fig. 7. In this figure, it is clear how the systematic elimination of independent variables has a relatively small effect on the error of the model until iteration 6, in which the error increases dramatically. Based on this behavior, the model of iteration 5 is selected and its exponents rounded. Step 4 of SLAW indicates that considering the numerical constant is the best enhancement to the rounded model, obtaining the following expression:

equation image(15)
Figure 7.

Evolution of number of parameters considered and corresponding error for backwards elimination in the application of SLAW to the ceramic to metal joining example.

In Fig. 8, this expression is compared with the results obtained from finite element analysis using ABAQUS, showing very good correspondence, especially considering the simplicity of the estimates.

Figure 8.

Comparison between calculated strain energy using finite element analysis (horizontal axis) and estimated values (vertical axis). Each point represents a different ceramic to metal joint. Please refer to the online version for color legends.


Seeking processing-structure-property-performance relationships is the accepted paradigm in materials science and engineering, and yet these relationships are often not linear. The challenge is to find a way to link materials behavior and seek patterns among multiple length and time scales. Data mining and statistical techniques permit us to survey complex, multiscale information in an accelerated and statistically robust and yet physically meaningful manner. When this is coupled to advanced tools for computational thermodynamics, kinetic simulations, and massive systematic experimentation, we have a powerful computational infrastructure for materials design.


KR and CS gratefully acknowledge support from the National Science Foundation International Materials Institute program for Combinatorial Sciences and Materials Informatics Collaboratory (CoSMIC-IMI)—Grant no. DMR-08-33853; AFOSR Grants no. FA95500610501 and no. FA95500810316 and the DARPA Center for Interfacial Engineering for MEMS (CIEMS)—Grant no. 18918740367 90B. PFM acknowledges support from NSF Career Award DMI-0547649.