## INTRODUCTION

The first materials informatics workshop defined the field as “the high speed robust acquisition, management, analysis, and dissemination of diverse materials data”.1 Materials informatics responds to the need for faster development times for new materials and to the unprecedented amount and complexity of materials information resulting from modern modeling and experimental techniques. Traditional techniques of analysis fall short in satisfying these needs, and new approaches are being developed2.

The widely agreed central paradigm of materials science and engineering revolves around understanding the relationships between processing, structure, properties, and performance for each material studied. The enormous complexity in studying materials can be traced to the uncommonly large number of variables involved in the relation between any two of these features. Compounding this challenge, we find that it is impossible in practice to uncover *all* variables, and the theoretical and experimental limitations of traditional approaches sometimes fail to uncover even the dominant variables. When important variables have not been identified, there are unexplained exceptions to what is believed to be understood, and in some cases, there is a lack of understanding altogether. This is the niche that materials informatics intends to fill. Materials informatics can also make sense of the vast amounts of materials data that is now available. This “knowledge extraction” includes the identification of outliers in the data, the development of models, pattern recognition, and forward and reverse data mapping1.

Materials informatics spans a broad range of tools ranging from systematic, combinatorial experimentation to sophisticated modeling. The modeling efforts can be divided into two main types: “hard modeling” and “soft modeling”.3 Hard modeling encompasses computational strategies involving advanced discretization, parallel algorithms, and a software architecture for distributed computing systems. Among these approaches are atomistic models and *ab initio* calculations, thermodynamic modeling, phase field simulation, and finite element modeling at a microstructural level. Soft modeling was first introduced by the life sciences and organic chemistry community, and it relates to statistically based, model-independent approaches. Among these approaches are the uses of regressions, neural networks, genetic algorithms, classification algorithms, principal component analysis (PCA), partial least squares, and other data mining techniques.

This paper aims to introduce to the statistical and data mining community two soft modeling approaches, one designed to study the structure-properties aspect of the materials paradigm and the other designed to study the processing-properties aspect. Other techniques introduced in this special issue address relationships involving the other aspects of the materials paradigm.

One of the earliest soft modeling efforts to address the challenge of excessive variables between materials properties was the one done by Ashby, who showed that by merging phenomenological relationships in materials properties with discrete data on specific materials characteristics, one can begin to develop patterns of classification of materials behavior4. The visualization of multivariate data was done using normalization schemes that permitted the development of “maps”, which provided new means of clustering of materials properties. As an example, one such map is shown as in Fig. 1. Ashby's approach also provided a methodology to establish common structure–property relationships across seemingly different classes of materials. This approach is valuable, but still presents difficulties as a predictive tool based on prior models for building and seeking relationships. In the “informatics” approach to studying materials behavior we approach it from a broader perspective. By exploring all types of data, such as crystallographic, electronic, and mechanical data over a wide range of materials, that may have varying degrees of influence on a given property(ies), and with no prior assumptions, we use statistical model estimation, predictive learning, and data mining techniques to establish both classification and predictive assessments in materials behavior5. The innovative aspect of these techniques is that the statistical approaches employed are enhanced by including the basic physics of the problem; for example, requiring that the predictions made have meaningful units.

As noted by Searls6, understanding the relative roles of the different attributes governing systems behavior is the foundation for developing models (Fig. 2). Materials design is a process that helps us determine the optimal combinations of material chemistry, processing routes, and processing parameters to meet specific performance requirements robustly such as mechanical properties and corrosion resistance. This process is iterative by nature due to the incompleteness of design knowledgebase and the lack of one-to-one correspondence in this inverse problem; that is, an effect can be the result of many different causes.

Of the two soft modeling approaches presented in this paper, Section 2 is based on principal component analysis and uses hundreds of data points (in this case, crystal chemistries) to seek patterns between crystal structure and chemical properties that otherwise would be difficult to achieve. The second approach described in Section 3 applies a set of computational strategies to represent processing and properties of data at a system level within a unit-consistent framework, which also allows for a reliable pruning of secondary effects.