A general data model for socioeconomic metabolism and its implementation in an industrial ecology data commons prototype

Until this day, data in industrial ecology (IE) have been commonly seen as existing within the domain of particular methods or models, such as input–output, life cycle assessment, urban metabolism, or material flow analysis data. This artificial division of data into methods contradicts the common phenomena described by those data: the objects and processes in the industrial system, or socioeconomic metabolism (SEM). A consequence of this scattered organization of related data across methods is that IE researchers and consultants spend too much time searching for and reformatting data from diverse and incoherent sources, time that could be invested into quality control and analysis of model results instead. This article outlines a solution to two major barriers to data exchange within IE: (a) the lack of a generic structure for IE data and (b) the lack of a bespoke platform to exchange IE datasets. We present a general data model for SEM that can be used to structure all data that can be located in the industrial system, including process descriptions, product descriptions, stocks, flows, and coefficients of all kind. We describe a relational database built on the general data model and a user interface to it, both of which are open source and can be implemented by individual researchers, groups, institutions, or the entire community. In the latter case, one could speak of an IE data commons (IEDC), and we unveil an IEDC prototype containing a diverse set of datasets from the literature.

. The same holds for data integration into so-called hybrid models that integrate monetary and physical process and supply chain information (Crawford, Bontinck, Stephan, Wiedmann, & Yu, 2018;Hawkins, Hendrickson, Higgins, Matthews, & Suh, 2007). Still, there is currently no comprehensive inventory of data describing socioeconomic metabolism (SEM) at different regional, temporal, process, material, and commodity scales and layers such as mass, energy, monetary, or social indicators. Barriers to data integration are both social and technical (Hertwich et al., 2018): • Lack of incentives, in particular, funding and data management schemes: When research time is scarce and data management competence is low, inappropriate data archiving and transfer of data into other projects may be the consequence. Research funders are aware of this problem and have started requiring data management plans and open access to data. Scholars from the community have issued a call for greater data transparency in IE research (Hertwich et al., 2018). Still, it is too easy to conduct and publish research that contributes to the knowledge base but not to the cumulative database of the field.
• Lack of cross-method data formats and platforms for exchanging IE data: A general data format for the industrial system does not exist. Although generic data sharing platforms such as Figshare (https://figshare.com/) or Zenodo (https://zenodo.org/) have been available for several years and are open and easy to use, they do not prescribe any particular data model or format and have therefore not noticeably alleviated the data integration and exchange problem. (Here, we think of a data model as a scheme that organizes elements of data and specifies how they relate to one another and to real world entities, cf. ecospoldv2 for an example from our community [Meinshausen, Müller-Beilschmidt, & Viere, 2016].) • A perceived lack of appreciation for sharing data, and the fear of losing competitiveness: Restrictive data sharing policies turn scientific data, a common good, into a local asset. They create local advantages for research groups in a competitive environment. The problem of scientific data turned into local assets is exacerbated by prevalence of consulting-style research in our field. The longer-term benefits of data exchange between individual researchers and the research community may not be appreciated enough.
The different IE research branches have found ways to deal with these barriers, and a number of data integration schemes are available.

Currently available schemes and services for data integration
Highly integrated MRIO (Tukker & Dietzenbacher, 2013) and life cycle databases (Wernet et al., 2016) form the informational backbone for entire scientific communities. These databases, however, include only a fraction of the available data on the industrial system needed for the different branches of IE research, and the IE community as a whole is not backed up by a comprehensive database. In material and energy flow analysis (MEFA) in particular, no common data format and only first attempts toward database development exist, focusing on particular software (www.stan2web.net/), data types such as product lifetimes (Murakami et al., 2010), or system scopes such as cities or national economies (Ravalde & Keirstead, 2015;Schandl et al., 2017; https://metabolismofcities.org/). That infrastructure gap leads to inefficiencies across the entire field because the descriptive MEFA forms the basis of process inventorying and other data that feed into both process databases and physical IO tables.
Data pooling services such as Figshare or Zenodo are available for publishing and sharing datasets with low thresholds and thus represent the first and most basic step any researcher should take toward sharing their data with minimal effort. But they have not alleviated the underlying problem, as these platforms recommend but do not require any structuring of the data, which are often hidden in pdfs or images instead. Nor do these services apply an IE-specific data model needed to provide sufficient metadata and systems context to facilitate transfer of data into other projects.
Although the barriers listed above hinder the free exchange of data, there are many datasets available in the literature as well as in different institutional reports. Many government-funded IE-related research projects include some type of dataset inventorying as an objective or initial work-package (e.g., http://www.prosumproject.eu/objectives, http://www.mica-project.eu/, or http://www.minea-network.eu/). Still, many published datasets come in custom formatting and often in form of pdfs, figures, or other documents that do not contain structured data. They are not inventoried, nor are they easily accessible by standard modeling tools.

Goal and scope of this work
The goal of this article is to outline a technical solution to two major barriers to data exchange listed above: the lack of a generic data model for IE data and the lack of a bespoke platform to exchange IE datasets. The general data model for SEM presented below can be used to structure a wide spectrum of data types describing objects (substances, materials, goods, products, or commodities) and processes (industrial transformation, storage, distribution, or consumption) in the industrial system, including process descriptions, product descriptions, stocks, flows, and coefficients of all kind. We describe a relational database built on the general data model and a user interface to it, both of which can be implemented by individual researchers, groups, institutions, or the entire community. In the latter case, one could speak of an industrial ecology data commons (IEDC), and we unveil a prototype for the IEDC that contains a diverse set of published datasets from the literature. The goal of the IEDC is not to encapsulate all IE data and thus replace existing databases, but to offer a data model and a platform for exchanging the many datasets of various types that are scattered across the literature. Future applications in the community and beyond are discussed at the end.

A GENERAL DATA MODEL FOR SEM
A candidate for a "general data model" for SEM must be able to represent the wide spectrum of data describing SEM and its links to the natural environment and to human agents. This spectrum includes: material and product stocks, energy and commodity flows, process yield factors and environmental extensions (emissions and resource use of processes), product lifetimes and material composition, and a large spectrum of performance and impact indicators.

Data in a systems context
Quantitative information on processes, stocks, flows, etc. in SEM has three components: F I G U R E 1 Selected major data types of socioeconomic metabolism, their aspects, and respective system dimensions (in brackets). One system dimension, for example, time, can link to different aspects, for example, historic time, scenario time, or age-cohort, in the description of the different data types. The system definition shows two processes, p 1 and p 2 , in two different regions r a and r b , each containing a stock. The asterisk (*) indicates the optional aspects of the different data types. A list of all identified aspects is supplied in Figure S1

Formal definition of the data model
The location of each data point (number quantifying a fact in a system) in the system can be written as tuple: data_item (aspect 1, aspect 2, aspect 3, …) = value trade_flow (cars, Japan, USA, 2016, number of units) = 2, 540, 000 units (1) Here, a tuple refers to a finite ordered sequence of elements (here: aspects), commonly noted in parentheses (…). The tuple notation of system variables and parameters is commonly applied in IE and other branches of systems analysis, and, notably, for the description of IO contingency tables (Geschke, Lenzen, Kanemoto, & Moran, 2011). For a tuple-based data model, each data point is represented as a value assigned to a tuple of data aspects in a discrete multidimensional space. The dimensions of that space represent the different system dimensions (time, region, product, etc.), which again are connected to the aspects of the data via the aspect-dimension link ( Table 1). The following propositions form the basis of the data model: Proposition 1. Each data point (numerical value quantifying a fact in a system) requires a certain number of aspects that locate it in the system dimensions.
Only when located in the system's dimensions, the data point has a clear meaning. The next step is to classify data points into types: Proposition 2. Each data type (family of data describing a certain phenomenon in the system, like stock, flow, material content, product lifetime, etc.) has a specific data model that (a) prescribes which aspects are required and which aspects are optional for the meaningful location of this data type in the system definition, and (b) defines the meaning of each aspect.
We then define the dataset, which comprises a set of data points of a specific data type that are organized together in a certain context, for example, a figure, table, or model. Examples of datasets are the greenhouse gas emissions of different countries during a given time period (data type: flow) or the quantities of different materials contained in a certain passenger vehicle type (date type: material composition).

Proposition 3. A sociometabolic dataset D is a function (or mapping) from a set of aspect domains A 1 , A 2 ,…, A n into the real numbers, augmented by null for
"no data." The data type of D prescribes which aspect domains are present, and the elements in the different aspect domains are the values that the different data aspects can take.
Each tuple of aspect values locates a data point in the system definition. To describe data uncertainty, the individual tuples can be annotated by supplying parameters of probability distributions (for aleatoric uncertainty) and min/max extremes or high-medium-low alternatives to capture epistemic uncertainty. The list of domains A 1 , A 2 , …, A n and their meaning are defined by the data type of D (cf. Figure 1). An aspect domain set contains all possible values for the system dimension describing this aspect, and these values are often referred to as classifications (of alloys, products, or industrial sectors). We generalize this term: TA B L E 2 The seven general data categories and common data types identified under the data model for socioeconomic metabolism Objects of interest (materials, goods, products, substances, commodities, waste, etc.)

• Process inventory
• Births/Deaths Extensive process properties (5) • Output or production capacity of process Stocks (

Correspondence (7)
• Correspondence tables between classifications Note: The list of data types in the table is not exhaustive. All phenomena or properties described in the system are modeled as one of the defined data types. For example, objects of interest appear as stocks or flows, and their intensive properties are modeled as material composition, specific energy consumption, etc.
Proposition 4. The aspect domain set used to define a dataset is called a classification of the corresponding dimension of this aspect, and the elements in that set, the aspect values, are the classification items.
When handling datasets, not only their data model but also the classifications used for their aspects must be specified. For example, the stock and flow data types shown in Figure 1 both have a good/substance index/aspect. Aspect domain sets can use custom classifications (custom material 1, custom material 2, custom material 3, etc.) or standardized values (year 2000, year 2001, … or the general product, industry, and substance classifications). Classifications refer to a specific system dimension, so that different aspects for that dimension can use the same classification.
Finally, we define the data group, which allows us to bundle datasets, for example, the different stocks and flows in a material cycle description or the flows, production volumes, etc. that make up a unit process inventory:

Proposition 5. A data group is a collection of different datasets from a common source or research project.
Examples and more explanations are provided in the Supporting Information available on the Journal's website.

Taxonomy of data types for describing SEM
The basic system structure for SEM consists of processes with stocks and flows between processes (Pauliuk et al., 2016). The two broad categories of description are "objects" such as goods, products, materials or substances, and their storage, distribution, and transformation in "processes." The objects are described using "properties," which can be intrinsic, that is, independent of system location, for example, the chemical formula of carbon dioxide, or extrinsic, that means depending on system location, for example, the magnitude of a trade flow depends on the countries it links.
In order to locate an object within the system context, only extrinsic properties must be specified. Further, properties can be intensive (independent of the amount of objects in the system) and extensive (additive for different objects) (Cohen et al., 2008;Pauliuk et al., 2016). The division of system elements into objects and processes and their extrinsic properties into extensive and intensive leads to four general data categories (Table 2), of which one (extensive object properties) can be divided further into stocks and flows, the two basic appearance modes of objects in a system (Pauliuk et al., 2016).
Data for individual or groups of objects, such as product lifetime or product material composition, describe the intensive object properties. The analogous group for processes contains, for example, the yield ratios of manufacturing processes, the greenhouse gas emissions per unit of output, or operating costs per unit of output. Moreover, data of categories 1-5 can be used to define ratios, like GDP per capita or per capita building stock, which together form a sixth data category "general ratios." Finally, a seventh data category was created to store correspondences between different classifications. Although the list of seven data categories is a core part of the general data model, the definition of specific data types under these categories the result of consensus-building among data providers.

Examples of data types and their aspects in the data model
To show how the data model applies to different data types, we list a proposal for the required and optional aspects of a number of frequently used data types, and provide a definition for each type, which describes how the numerical information (value/uncertainty) for a given mass, volume, energy, or monetary (layer) relates to the different aspects of the dataset. Aspects with (*) are optional.
• Flows (category 1): A flow is an extensive system variable describing the relocation of material ( A list of all hitherto defined data types can be found in the Supporting Information on the web and on the IEDC GitHub repository.

Structure and resolution of different data types
The explicit listing of the core and optional aspects for each data type, introduced by the data model presented here, allows scholars from different modeling communities to enter a dialogue and reach consensus about the aspect structure and the semantics of the different data types that are used across methods. The data model imposes a common structure for the different data types. But it remains flexible regarding system scope and resolution, the choice of which remains research question driven, project specific, and at the discretion of the data gatherers. For each dataset, the classification items of the different aspects can follow an established global classification, such as ISO regional codes, use a project-wide classification such as the ecoinvent 3.4 activity list (https://www.ecoinvent.org/), or can be given in a custom classification that applies to a given dataset only. Intrinsic information about the different aspect items, that is, information that is independent of the system context (chemical formula of CO 2 , region ISO codes, etc.), can be recorded as separate attributes along with the dimensional classification items.

IMPLEMENTATION OF THE GENERAL DATA MODEL IN THE IE DATA COMMONS
The data model can be implemented in different ways, including spreadsheet-formatted data, relational databases, or array-shaped data in programming environments. Data accessibility and the level of database integration determine the circle of potential users and applications and require proper consideration. Here, we present a proposal for an IEDC based on the general data model.

Choice of database integration level
A number of design choices need to be made when developing a database for SEM; they fall into the following categories: • Data structure: Extent to which a specific data model is prescribed.
• Data aspect classification: Custom, project specific, discipline specific, or universal.
• Data interoperability: Level of formatting and machine readability.
Different levels of data integration are possible, and the following three broad cases are typical (cf. also Table 3): • Low level of integration: Data are inventoried by keywords but come in custom aspect classification and file format. • Intermediate level of integration: Data are cast into data models but not integrated as the different datasets keep their original classifications.
Data are easy to access to be used for a wide range of research questions. This is the gap the IEDC tries to fill.
• High level of integration. Data, also of different types, are interlinked via a project-wide classification for the different data aspects, which makes the database internally consistent. MRIO tables, life cycle databases, or UN trade statistics are examples of different degrees of data linking.
Large datasets are relatively easy to use but the set of research questions they are applicable to is limited due to given classification and data scope.
Other integration levels can be thought of. For example, low integration with the additional requirement that all datasets are version managed and machine readable, such as with the Dat Project (https://datproject.org/).

Database structure of the IE data commons
For the IEDC prototype, we chose the intermediate level of integration because of the prevailing infrastructure gap. The common data model presented above is used, a mix of general, subject-specific, and custom aspect classifications is allowed for, and all datasets are transformed into a common data format. All data can be linked to general classifications, which again can be linked to the Semantic Web. The tuple-based data model presented above is represented as OLAP cube, where OLAP stands for "online analytical processing." An OLAP cube is a multidimensional array representation of tuple-based datasets, also applicable to IE data (Gray, Bosworth, Lyaman, & Pirahesh, 1996;Lupton & Allwood, 2017). In applications and databases, OLAP cubes can be implemented in different ways, for example, multidimensional arrays, 2D tables with multi-indices, or a list of tuples.
We chose a relational database model for implementing the OLAP cubes as it is well established, easy to set up and therefore suitable for a prototype. A simplified representation of the relational database structure is shown in Figure 2. The database core comprises six tables ("data" for storing data points, "datasets" for storing dataset aspect structure and metadata, the "projects" and "datagroups" tables to group datasets, and the "classification_definitions" and "classification_items" tables to define and describe the dimensional classifications used). The IEDC does not contain one but many different datasets, which means that next to a data

Implementation of the prototype
For reasons of convenience, a MySQL relational database server was chosen because it is easy to set up and maintain, also by non-experts, and graphical user interfaces are available from off-the-shelf libraries. An Excel template featuring both table and list-shaped data and a Python import library (https://github.com/IndEcol/IEDC_tools) are used to transfer template data into the database. Data providers need to apply basic formatting to incoming data and make an effort to use already existing classifications to label the different data aspects. The data importer will then automatically link the supplied individual classification items of the different data aspects to the already defined classifications. If a custom classification is used for a certain aspect, this is indicated in the data set and a new classification is then created and used during the upload routine. Details are provided in the supporting information on the Web. For nontemplate datasets, custom scripts can be used as well. Uncertainty information for random and systematic errors can be provided for each data point. There are options for entering the parameters for the commonly used probability distributions as well as low-mean-high alternatives or standard deviations.
The data importer has the following functional specification: (a) parse data template and read all metadata and data and make sure that all are formatted correctly; (b) check whether all metadata (user, data type, and data category) exist in the database so that the new data can be linked to them; (c) check whether classification items for already registered classifications exist so that the new data can be linked to them; (d) create custom classifications from classification items gathered from data template so that the new data can be linked to them; and (e) link new data to classifications and lookup tables and insert them into database.
For quality control purposes, data are first uploaded to the review database, which has the same structure as the main database. Only once the data provider has confirmed that data were uploaded correctly, the dataset is moved to the main database. Figure 3 shows the data flow for uploading, reviewing, and retrieving data. Although the web interface can be used to search for, select, and download individual datasets in different formats, bulk upload and download as well as custom querying of the database are available by means of SQL queries. Built-in or external objectrelational mapping packages (ORMs) facilitate direct access from, for example, Python (pymysql) or MATLAB.
The IEDC prototype presented here allows scholars to record the system location and the meaning of a wide array of sociometabolic data in a systems context. The data model does not record and cannot resolve conflicts between different datasets that result from classification F I G U R E 3 Proposed industrial ecology data commons data flow for uploading, retrieving, and downloading data conflicts or unclear or conflicting system boundary settings. Resolving such conflicts requires additional work, and the resulting refined datasets can again be distributed via the IEDC. To prepare the data for future use in other projects, data suppliers should make an effort to document background information, especially where possible data interpretation conflicts are known to arise, for example, for ore extraction (before or after concentration?) or mass and energy layers (dry or wet mass? Lower or higher heating value?). The prototype of the Industrial Ecology Data commons, containing more than 800,000 data points in 128 datasets on February 15, 2019, is accessible via www.database.industrialecology.unifreiburg.de.

DATA INTEGRATION FOR IE RESEARCH: THE WAY FORWARD
The data model and data commons prototype presented here offer an immediate benefit to the community by providing a platform for direct download of available datasets and by offering a method for structuring datasets that currently hibernate in pdfs, spreadsheets, and other documents.
Below, we present our initial thoughts on the wider implications of the data model.

Linking data across IE and beyond
Translation schemes to convert established data models to the comprehensive one presented here are already available on the IEDC GitHub repository, including the ecospold activity data model (Meinshausen et al., 2016), the general knowledge model for LCA (Kuczenski, 2018;Kuczenski, Davis, Rivela, & Janowicz, 2016), and the tabular representations of MRIO data for which a tuple-based data model already exists (Geschke et al., 2011). Translations from other available formats such as the Sankey diagram (Schmidt, 2008) data format (Lupton & Allwood, 2017), the unified materials information system , the data format of the STAN software for MFA (Brunner & Rechberger, 2016), a recently published data characterization framework for MFA (Schwab, Zoboli, & Rechberger, 2016), and the databases for urban metabolism (https://metabolismofcities.org/) (Ravalde & Keirstead, 2015) would represent useful future additions.
Seeing the format of established databases such as ecoinvent datasets and MRIO tables as different representations of a common underlying structure helps to understand and compare the different data models used, which is crucial when combining databases in a more automated manner than it is done today. Complex datasets, such as an ecospold unit process descriptions or MRIO tables, can be broken down in the more basic data types of the IEDC. For example, an IO table consists of different flow and coefficient tables (interindustry, industry to environment, industry to final consumers, etc.), and an LCA unit process inventory (ecospold) consists of normalized flow data, production volumes, and intensive data such as water content and price information. All these individual datasets fit into the general data model for SEM presented here, and their combination into a subject-specific database can be described as a data group.
Once the data have been recast into a general structure, they can be linked to each other to speed up the extraction and analysis of datasets describing entire systems, for example, to quickly find all data related to "steel production" in "China" in "2015," no matter which database, method, or subfield they come from. In a first step, datasets can be linked by supplying common keywords. Exact (and machine-operable) links, however, can only be established by using common classifications for the different data aspects across datasets.
To improve the usefulness of the data outside IE and link them to other information, explicit links to other data realms need to be created.  Davis, Nikolic, & Dijkema, 2010;Ghali & Frayret, 2018). If data providers use URIs as well, automatic data updates could become a standard procedure in the future. To further facilitate machine processing of data, the linked IEDC data could be broken down into "subject-predicate-object" triples, commonly known as Resource Description Framework (RDF) data (Schreiber & Raimond, 2014). A pilot project on using RDF for storing IE data conducted by the main author found that the toolchain for storing and querying triple databases was much harder to implement and operate than the off-the-shelf relational database servers, so that the latter remain the more practical option given the presently available resources.

Expansion and limits of the data model
The tuple notation of data is very general and flexible due to the arbitrary number of dimensions, aspects, and classifications allowed. Footprints and criticality metrics, for examples, are systems' context indicators that cannot be measured directly and that are qualitatively different from process indicators and intensive object properties like material content. Nevertheless, they can be recorded under the data model, simply by defining a "footprint" or "criticality" aspect and recording the corresponding data for different regional, temporal, and system scopes.
MEFA is often used to describe processes in the environment (e.g., nutrient balances in the soil, in lakes, or forests) and the data model presented here can be applied to nonindustrial systems as well, as long as suitable system dimensions and data aspects are chosen.
The data model and the IEDC were created from the necessity to describe the system in discrete regions, processes, and commodity groups. For geographic data, such as those stored in shapefiles and other data that describe a continuum, such as high resolution time series or satellite images, the data model still applies as long as these data can be placed in a system definition of the type described by Pauliuk et al. (2016). The database structure presented here cannot efficiently accommodate very large datasets, such as high-resolution data, however.
The data model allows for recording of basic metadata at the data item and dataset level. Metadata specification varies substantially across fields, (e.g., compare available ecospold metadata with available MRIO metadata) and future work needs to identify how these rich but differently formatted metadata can be brought into a general structure.

Building the foundation of community data infrastructure
The IEDC prototype is a demonstrator for how data infrastructure in the IE community could function at intermediate levels of Next to a common underlying data structure, the main barriers to higher levels of data integration are that universal classifications for processes,

Review, version control, and traceability of data
The IEDC prototype stores data that are published elsewhere, so that these data have already passed quality control like peer review, and the main focus of the review process when uploading to IEDC lies on the correct application of the data model chosen and the correct transfer of numbers. This formatting process and the subsequent use of data often leads to insights that warrant the publishing of a revised dataset version, and the IEDC data structure facilitates the recording of different versions of a dataset due to the UNIQUE constraint on the (dataset_name; dataset_version) tuple in the dataset table. This setup does indeed not allow for the versioning of individual data items within a dataset. Here, the link between data version management tools (e.g., via https://datproject.org/) and a static database needs to be explored. The IEDC prototype offers upload and download options from and to Excel templates (containing the dataset description and the data items) and from and to custom formats using scripts using SQL queries. More versatile formats need to be developed, such as JSON, rdf, or csv files, and shared and version-managed as so-called data packages (https://frictionlessdata.io/data-packages). For the time being, the documentation of the process of converting available data from the different sources, accounting routines, or model calculations into the IEDC template is under the responsibility of the data provider, and no guidelines or standards exist. A convenient way of tracing this process is to use additional sheets in the Excel template. For data that are only available in a nonportable format, such as tables or figure in image or PDF files, we encourage the use of the liberated_data project (https://github.com/nheeren/liberated_data) to increase reproducibility and transparency in the data extraction process.

Licensing and legal issues
When compiling the data for the prototype, we realized that most published datasets do not specify a license for their use. Some appear to "inherit" the license of the document containing them, such as the supplementary material of journal publication. The lack of licensing transparency and the barriers of reusing data due to intended or unintended restrictive licensing clearly are problems that need to be solved. The issue has already been taken up by a recent call for more data transparency in IE (Hertwich et al., 2018) and by the open energy system modeling community, who, The data model for SEM provides a general structure for many common data types, including stocks, flows, concentrations, lifetimes, and process coefficients. It clarifies the relationship between data, the underlying data types and categories, the data aspects, and the link between data aspects and system dimensions. Well-structured, well-described, and accessible data are key to storing data for later use, both within and across research groups. The data model presented here helps structure and consolidate quantitative systems information and thus make data easier to archive and reuse without loss of meaning. Moreover, the possibility of interlinking data allows for new ways of IE systems analysis.
A data commons is a major building block for open science in IE. A strong and vibrant research community, fueled by a rich and open database, is attractive to future scholars, visible to other research fields, and has high impact with decision makers and the general public. Its establishment requires commitment by individual researchers to submit data and by the community to build up and maintain the infrastructure. The benefit of such an investment would be immense.