“Big Data” (BD) are transforming commerce and public policy. BD are large, complex collections of data not readily manageable in common tools that present unprecedented opportunities, according to Hampton and colleagues (2013), for advancing science and informing resource management. Given the importance of life cycle assessment (LCA) in understanding resource management in existing and emerging technology systems, it is valuable to consider the role of, and challenges for, BD in LCA and how we can work together to wield their collective power.
Big Data have been discovered, accessed, sampled, and used in the development of LCAs and sampled data are explicitly and implicitly disseminated in open-access LCA data repositories.
For some time, the volume and complexity of BD in LCA have been managed using descriptive or inferential statistics. For example, BD representing ambient temperature and precipitation point measurements collected by the U.S. National Climatic Data Center1 and sampled to represent conditions at aggregated geographic levels (e.g., state, country, and hydrological regions) have been used in the fate and transport component of the characterization factors used in life cycle impact assessment and for fuel combustion volatilization estimates in life cycle inventories (LCIs). Also, BD collected by the U.S. Department of Agriculture (USDA) National Agricultural Statistics Service are sampled to prepare the Agricultural Resource Management Survey2 (ARMS) and have been used to provide estimates of land use, seed use, irrigation, tillage, crop residue management, and use of nutrients, manures, and pesticides. Likewise, BD collected by the U.S. Environmental Protection Agency (US EPA) and the U.S. Energy Information Administration are sampled to prepare the Emissions and Generation Resource Integrated Database3 (eGRID) to provide resource mix, heat input, and select air emissions data for U.S. electric power generation.
Although these examples were quite easy to develop, the use of BD in LCA is not comprehensive: there are far more estimates prepared with “small data” (sometimes point estimates) and models than are sampled from BD. Combining BD with point estimates in LCA reveals a disturbing mismatch: whereas flows sampled from BD can and should be accompanied by measures of statistical dispersion, they may appear in an LCA uncertainty analysis to be a troublesome source of variability when, in fact, they provide a better representation of the reality LCA is intended to portray.
Also, BD sampling for LCA can result in substantial sampling error (see Cooper et al. 2013), in particular when data/assessment bounds are at national or global levels. For some time many LCA practitioners, notably those developing and using impact characterization factors, have called for increased site specificity. Sampling BD for smaller geographic areas should provide yet another improved representation of the reality LCA is intended to portray. However, a fine spatial resolution accompanied by significant sample sizes has, in the past, been impractical as a result of computational limitations. With today's emerging tools for regionalized LCA (e.g., OpenLCA), new insights will be possible through further exploitation of BD.
Given these reflections, the emergence of open-access LCA repositories provides yet another piece to the BD puzzle by harnessing the power of crowd-sourced information. Consider the three LCA data repositories depicted in figure 1: the USDA's LCA Digital Commons4 (USDA LCADC); the U.S. Department of Energy National Renewable Energy Laboratory's Life Cycle Inventory Database5 (NREL LCI DB); and the European Commission Joint Research Centre's European reference Life Cycle Database6 (JRC ELCD). These repositories are designed to accept and disseminate all the types of data needed to develop an LCA as well as LCI and impact results and related articles. Unit process and characterization data are prepared centrally or by individual LCA practitioners, following essentially the same workflow from the definition of complete data through data formatting and submission and including supplementation (the use of additional, smaller data and modeling) or the identification of missing data. However, this open and collaborative model for data production creates new challenges in data integration and harmonization.
As shown in figure 1, BD make their way into each repository following three paths. In the first path, raw sampled BD enter a repository when data sets are parameterized so that the raw sampled data and the formulas using them appear within the data sets as opposed to in supporting documentation, as described by Cooper and colleagues (2012). In the second path, a practitioner's submission uses BD in the development of unit process and characterization data with the data and calculations described in the supporting documentation (i.e., BD are aggregated in a data set). In the third path, a practitioner uses BD in the development of unit process and characterization data as described in the supporting documentation and submits the life cycle results, sometimes as a result of using fee-based data that cannot be submitted to an open repository. Each successive path decreases BD transparency and although this applies to both unit process and characterization data, to date parameterized characterization factor data do not exist.
Figure 1 was inspired by Reichman and colleagues's (2011) description of the Data Observation Network for Earth7 (DataONE) which provides a repository/cyberinfrastructure to federate currently independent ecology data networks as the foundation of new innovative environmental science. Reichman and colleagues describe a number of challenges (resulting from data dispersion, heterogeneity, provenance/reproducibility, and inadequate rewards for sharing data) that relate to BD and other data in LCA repositories. Data dispersion and heterogeneity are found to be partially addressed by large regional and subject-oriented data collections, mirroring the subject-oriented structure of the USDA LCADC for bioproducts, but contrary to the all-industry formats of the NREL LCI DB and the JRC ELCD as well as databases such as ecoinvent, GaBi, and SimaPro. For heterogeneity specifically (i.e., the use of different terminologies, specialized measurements, and experimental designs), the adoption of common experimental practices and measurement standards and the use of the Semantic Web is enhancing ecological data interoperability. For LCA, the USDA is harmonizing data about products exchanged between unit processes using a library of data sets available in public and private LCA databases in coordination with the US EPA harmonization project (Hawkins et al. 2013) which is addressing such issues for impact assessment through semantic mediation.
Reichman and colleagues also note that “provenance is especially important to support scientific results used in policy and management decisions, where field experiments and techniques may not be fully reproducible due to difficulty of replicating environmental conditions.” They cite the progress of computer scientists in developing ways to capture provenance information (e.g., through scripted analysis systems such as R and scientific workflow systems such as Kepler and Taverna). For LCA, two efforts seem to be contributing. First, data set parameterization explicitly presents sampled raw BD and formulas in data sets, allowing a high level of transparency and modification to alternative conditions. Second, Product Category Rules provide an expert-elicited definition of LCA completeness (see figure 1), which provides a common, product-specific workflow end point.
Finally, Reichman and colleagues's concern with inadequate rewards for sharing data (promoting acknowledgment) seems at some level to have been overcome: to date, thousands of researchers have contributed to DataOne through its network of working groups and member nodes.8 However, Hampton and colleagues (2013) note that many ecologists simply fail to contribute because of the lack of a culture of data curation and sharing and state that single-investigator projects “tend to have higher levels of direct investigator involvement in the data collection, as compared with the automation required for big-science projects.” They conclude that a cultural shift has occurred in the sharing of genetic data because journals require that submitting authors provide accession numbers for GenBank or TreeBASE. Likening all of this to the development of data by individual LCA practitioners (figure 1, bottom), we note that the U.S. government is requiring that its data be “open and machine readable” through Executive Order 136429 and that funding organizations, such as the National Institute for Food and Agriculture, are requiring that some investigators submit data to the USDA LCADC. However, journals publishing LCA results currently have no requirement related to repository submissions.
Thus, BD have been discovered, accessed, sampled, and used in the development of LCAs, and sampled data are explicitly and implicitly disseminated in open-access LCA data repositories. Consistent representation of sampling error is important and should be aligned with efforts to increase the geographic specificity of data. Ideally, the future need for supplementation in key areas would be reduced, if, as a research community, we can determine what the key areas are (i.e., a BD gaps analysis by product category). It seems there will always be some need to supplement data within LCA, as important and transforming technologies emerge. It is conceivable, given advances in data management and analysis, that BD might be accessed live within cyberinfrastructures and consistently among unit processes and characterizations within a single or multiple LCAs.