Multisource spatial data integration for use cases applications

The reuse and integration of data give big opportunities, supported by the FAIR data principles. Seamless data integration from heterogenous sources has been an interest of the geospatial community for a long time. However, 3D city models, building information models, and information supporting smart cities present higher semantic and geometrical complexity, which pose new challenges never tackled in a comprehensive methodology. Building on previous theories and studies, this article proposes an overarching workflow and framework for multisource (geo)spatial data integration. It starts from the definition of use case‐based requirements for the integrated data, guides the analysis of integrability of the involved datasets, suggesting actions to harmonize them, until data merging and validation. It is finally tested and exemplified in a case study. This approach allows the development of consistent, well‐documented, and inclusive data integration workflows, for the sake of use case automation in various geospatial domains and the production of interoperable and reusable data.

through the web (e.g., open data, linked data, data spaces), opened new perspectives toward the reuse of existing data for further use cases than the ones for which they were originally collected. Similarly, the Findable, Accessible, Interoperable and Reusable (FAIR) (https://www.howtofair.dk [Accessed 6th June 2022]) data principles support a new sharing economy for data.
However, although many initiatives have been developed toward interoperability, such as standardization actions, ontology-related research (Kavouras & Kokla, 2007), and GeoBIM research (for the integration of geoinformation and Building Information Models, that is, 'BIM'), a comprehensive methodology for multi-source data integration has not been proposed yet. Mohammadi et al. (2006) identify the aspects to be considered for the data integration as institutional, policy, legal and social, besides technical. This article is focused on the technical side of the integration.
According to Kavouras and Kokla (2007), to build an integrated view of heterogeneous systems requires: (1) identifying the heterogeneities; (2) analyzing importance and priorities; and (3) solving them through a systematic strategy. Wiemann and Bernard (2016) define the steps for data integration, called "data fusion" as: (1) data search and retrieval; (2) data enhancement; (3) harmonization; (4) relation measurement; (5) feature mapping; (6) resolving; and (7) data provision. They propose an interesting approach based on linked data. Mohammadi et al. (2010) proposed a methodology and a tool to facilitate spatial data integration within spatial data infrastructures, considering the technical and non-technical issues. However, the increased complexity of data available nowadays (3D, deep and complex data structures) presents new challenges from the technical point of view that need to be tackled to achieve an effective integration.
Existing efforts often consider mainly the semantic and structural aspects of data integration (e.g., Lenzerini, 2002).
For example, Kavouras and Kokla (2007) outline relevant methodologies for ontologies integration, which are partially reused, adapted, and eventually referred within this article. However, their focus is on semantics and structure, while it is important for this study to consider as well the complexity of geometry as a separate issue. Furthermore, data from practice hardly reach the complexity considered by Kavouras and Kokla (2007).
Other cases report methods to merge the geometric information from multiple sources or sensors ("data fusion") (e.g., Ahn et al., 2020;Dong et al., 2018;Ramos & Remondino, 2015;Zhu & Donia, 2013). Data fusion techniques are developed to integrate (big) data by means of different criteria and methods intended to automate the integration, that is, data-, feature-, and decision-level fusion (Yin et al., 2021;Zhang et al., 2022). More complete database integration was also studied, but still for 2D Geographical Information Systems (GIS) (Devogele et al., 1998, Uitermark et al., 2005.
Standardization efforts have been intended to solve the interoperability, and consequently integration, issues.
However, their development is still little aligned with their implementation in software and adoption in practice , which makes the original aim of standardization still an open problem (Section 1.1).
In this article, the features of spatial data are analyzed in detail, including the characteristics defining the complex 3D information systems, such as BIM and 3D city models. The needs of the processing toward harmonization of the data are defined accordingly, and some relevant available methods to perform such processing are mentioned as initial guidance. Those phases are inserted in an overall workflow guiding from the choice of the input datasets until the final merging and validation of the harmonized data. A critical starting step to initiate a successful integration process is the definition of the requirements for the finally obtained integrated data, to support the intended use case applications. A concrete and well-defined scope and use of the data (including software and procedural details) is the preferable way to success and allows validation and testing.
While most of existing methodologies focus on a few aspects of data integration, sometimes neglecting the features of data as provided in practice, a pragmatic workflow is proposed in this work to support overall data integration. The adopted approach strongly relates to the definition of requirements for the integrated data, and decomposes the integration into sub-issues to be analyzed and tackled in detail.
The methodology combines methods common in ontology engineering and data fusion, with experiences in 3D information systems integration and GeoBIM, as described in Section 2. The results are explained starting from the proposed integration workflow (Section 3) and the parameters and specific features to be considered in the integration (Section 3.1). After some suggestions for data requirements definition (Section 3.2) and data retrieval (Section 3.3), a rubric to analyze each feature or parameter of the input data, assessing their potential for integration, is proposed in Sections 3.4 and 3.5. In Section 3.6, a range of methods are mentioned to tackle the processing pointed out by such analysis and assessment. Finally, Section 3.7 presents methods for fusing the harmonized data, and validation. The proposed methodology is exemplified in a case study (Section 4). The discussion (Section 5) follows, including the potential automation of the proposed workflow (Section 5.1) and an analysis of metadata standards (Section 5.2) possibly facilitating the preliminary integrability assessment.

| Open standards and related issues
To support interoperability and integration, Open standards are published for different domains. For the representa- December 2021]) with the specific scope of representing energy-related features of buildings and constructions, to support analysis.
However, these standards also present disadvantages, like the big efforts for producing compliant data or the limited flexibility in some cases (Doerr, 2004). At the same time, they often propose very comprehensive schemas, aiming at covering the entire domain, and leave the possibility open to use the model in very different ways, to adapt to different use cases' needs. Although it makes such standards suitable for a large variety of representations, interoperability and integration processes suffer from this, because data become quite unpredictable, even if compliant with the same standard. It is essential to know how the (often ambiguous) models were interpreted, with respect to structure, semantics, and geometry, and how the structure was used, for example, which of the allowed options is used to store some specific information, such as georeferencing (Clemen & Görne, 2019). In addition, extensions and generic entities can be used to extend the prescribed model further, resulting in an even wider possibility to produce standard-complaint conflicting data. Clear definitions and examples are still seldom available to allow a consistent use of the standard data models (including interpretation of classes meaning, attributes and relationships), despite the general tendency toward the improvement of such aspects in the current standardization efforts.
Moreover, the use and implementation into concrete tools and data from practice often present issues  and standards are not used consistently enough to provide fully automatically interoperable and integrable data. Figure 1 shows an example of three 3D city models of Rotterdam (NL). One is the 3D city model developed by the City of Rotterdam (https://www.3drotterdam.nl/#/ [Accessed 24th November 2021]), the second was generated by the software 3dfier (Ledoux et al., 2021), and the third one is the Basisregistratie Grootschalige Topografie (BGT), the Dutch national topographic map, and is structured according to the IMGeo (https://www.geonovum. nl/geo-standaarden/bgt-imgeo [Accessed 24th November 2021]) data model, which is modeled as an Application Domain Extension (ADE) of CityGML v.2 ( Van den Brink et al., 2013a). All of them are CityGML-compliant. However, they result in quite different models (Colucci et al., 2020).
As a consequence, an in-depth analysis is necessary to consider the relevant parameters and characteristics involved in interoperability, explained in Section 3.1, even if the data are declared standard-compliant.

| Interoperability versus data integration
The two concepts of interoperability and integration are often confused. However, even if being strictly related, they have a different meaning. Kavouras and Kokla (2007) state that "interoperability is the ability of systems or products to operate effectively and efficiently in conjunction, on the exchange and reuse of available resources, services, procedures, and information, in order to fulfil the requirements of a specific task." They add that "it is not exhausted with integration, but also involves means of intelligent communication such as querying, extraction, transformation etc." Moreover, F I G U R E 1 A comparison of entities, attributes (in pink the attributes used as generics) and the visualization of three different CityGML-compliant models.
interoperability in a broader governance-related domain is defined as "the ability of organisations to interact towards mutually beneficial goals, involving the sharing of information and knowledge between these organisations, through the business processes they support, by means of the exchange of data between their ICT systems" (EU, 2017). In such a context, four interoperability layers are identified: (1) technical; (2) semantic; (3) organizational; and (4) legal interoperability. These are interconnected and include the usual aspects considered for interoperability: technology, regarding information and communication technology systems and software; data; humans, that is, needed skills, know-how, and related general knowledge and practice; institutional practices, that is, the processes and best practices implemented in everyday life within institutions and practice. The scope of this article covers the aspect of data, in particular in relation to the so-called "semantic" interoperability, but has strict relationship and influence on the technical interoperability of data and on the human side of interoperability, concerning data interpretation and description for reuse (gray dotted rectangle in Figure 2).
Interoperability can be considered as a characteristic of single data, allowing their reusability across systems (e.g., their potential for being consistently imported-exported by software) . Integration is instead the combination or conflation of information from different datasets (Worboys & Duckham, 2004). Figure 3 depicts what the two concepts entail and how are they related to each other.
This study is particularly focused on data integration.
Moreover, the relevant set of data parameters and aspects to be considered for the integration is explicated and organized into a framework. It is intended to provide a reference for assessing the integration potential of datasets with respect to destination data requirements, as well as to guide in the harmonization.
Processing methods are then suggested per each case, to transform and convert the input datasets, as necessary to harmonize them with the data requirements. Potential usable methods are reviewed.
The methods to solve each of the steps in the integration workflow are many and need to be chosen according to the kind of processing needed and the kind of data involved. Therefore, in this article, it is not possible to give one recipe to fit all cases, but several methods from literature are proposed in order to overcome the most usual issues.
The proposed framework is intended to point out the needs and guide in the process.
This framework is finally validated and tested in a case study, regarding the update of a 3D city model by means of the integration of a BIM model of a newly designed building. The focus of the experiment is not on the processing itself, therefore, many steps were performed manually or by means of existing tools. Other similar cases were proposed, for example, by Noardo et al. (2016) and van Heerden (2021).

| THE PROPOSED WORKFLOW AND FRAMEWORK FOR MULTISOURCE DATA INTEGRATION
A workflow for an effective methodology for data integration is depicted in Figure 4. It is comprehensive of the starting phase, that is, definition of requirements, until the final phase regarding the update of metadata after data merging. Moreover, the several issues for data integration are considered, including the legal constraints, harmonization, data merging, and validation. Available examples in literature usually focus on a part of it, without considering the other aspects explicitly. Since this article is intended as a general framework to guide the integration concretely, we need to consider all the steps consistently.
The integration effort for a specific use case is considered. Therefore, the essential starting point, critical in this methodology, is the definition of the requirements for the data to be obtained after the integration (Section 3.2).
The parameters to be defined in the data requirements definition, as well as in the following phases, are listed in Section 3.1. In addition, the non-technical features, as defined in Mohammadi et al. (2006), must be planned.
Based on such defined requirements, the input datasets must be retrieved in order to cover the defined need for information (Section 3.3). It is necessary to double check that the legal properties of input data are not conflicting with the integrated ones (e.g., copyright, privacy, licenses, and so on.). In case they conflict, it should be assessed whether input data can be transformed to meet the requirements (possibly through data generalization, anonymization, attribute removals, etc.). Otherwise, either an adjustment should be done in integrated data requirements (i.e., future data specs), whether possible, or other data must be retrieved. Then, the assessment of data integrability must be performed, according to the framework proposed in Sections 3.4 and 3.5. From that analysis, it can be assessed whether it is possible to use the selected input data in the integration, and if the required effort to harmonize them is worth doing it. Otherwise, different data should be selected. If the assessment is positive, the harmonization actions (enrichment, generalization, conversion) must be chosen and applied for each considered aspect (Section 3.6). Data fusion will then allow obtaining the integrated dataset (Section 3.7). Final steps are the validation of such a dataset against the defined data requirements and the update of metadata to keep track of the applied processing. (2017) with the aspects of interoperability mapped (data, humans, institution, and technology). The gray dotted rectangle identifies the scope of this article with respect to interoperability.

F I G U R E 2 Interoperability layers from EU
F I G U R E 3 Data interoperability vs. data integration.

| Relevant data features and parameters
In this section, the data characteristics are explained that should be always explicitly prescribed by data requirements in case of data acquisition and modeling, or harvesting, described in metadata and considered in data integration efforts ( Figure 5).
Resources on spatial data management and integration distinguish between "semantic" level (i.e., difference in conceptualization and definition-including terms used, specific meaning and classifications); "structural" or "schematic" level (i.e., the conceptual model or schema structuring the data-relations between entities and attributes, relationships, and hierarchies); and "syntactical" level (i.e., the format of the data). Semantics and structure are very much interrelated, and one cannot be considered without the other. In fact, they are often considered together simply as "semantics." However, they are treated separately here, according to other authoritative sources in the literature (Doerr, 2004;Kavouras & Kokla, 2007;Worboys & Duckham, 2004).

F I G U R E 4 A workflow for suitable data integration.
F I G U R E 5 Synthesis schema of the data parameters relevant for data integration, as considered in this article.
What is generally missing in the literature dealing with data integration theory is the geometry level. Often, this is treated as part of the semantic content of the data, such as in the ontologies-related field (Kavouras & Kokla, 2007).
In other studies, geometry is the main focus of integration, like in the data fusion and 2D GIS integration literature, but the others are neglected. In this article, the "geometric" level is considered as a separated aspect of the data, since it presents specificities which need to be tackled by means of specific methods. Overlooking the geometry could cause serious issues to data integration, reducing the potential for such data to be used within automatic tools.
Obtaining a consistent integration including of geometry, instead, supports a powerful reuse of data for various use cases, such as exploiting the BIM or GIS software for (geo)spatial data analysis and smart cities support.
In addition, more general properties of the data must be considered, related to their content and intended use, in order to make a preliminary assessment about their usability for the designed goal. These are the geographical extent, time frame, and scope. The geographical extent is the real location of the objects represented by the data. It is defined by a specific spatial extent located with respect to the Earth's surface, which could be 2D (planar location) or 3D (considering heights and z values). The temporal frame indicates the time period in which data were acquired or updated. The scope of the data is the definition of the part of reality to be represented and the intended use for which the conceptualization was designed. It determines differences with respect to coverage and detail, granularity (e.g., building and building elements vs. all city objects), classification perspectives and consequent relations (e.g., building door as part of the internal distribution system of the construction vs. building door as address), semantics, due to the kind of partition of reality, that is, specific intended meaning of a term or concept, with respect to synonymy, homonymy, and different meanings for the same term (e.g., slab intended as structural element or slab intended as all the package dividing the storeys from each other) (Kavouras & Kokla, 2007), and geometry. Therefore, considering the general scope of data in the initial assessment could be meaningful even before analyzing the previously mentioned parameters, that specify the scope in detail (geometry, semantics, structure, and syntax).
Geographical extent, temporal framework, and scope can be documented as objective information. Further, it is relevant to take into account some more qualitative background of the data, which has an impact on modeling and implementation choices, to properly (re)use the data and avoid mistakes in their interpretation.
The original goal of the data (intended use) and the specific use case requirements, for which they are produced, influence the data themselves ("perspective" in Kavouras & Kokla, 2007), affecting the modeling (geometry, semantics, and structure) and storage (syntax, data format).
The lineage of the data determines their final characteristics that must be known by the data users (accuracy, objects represented, etc.) (e.g., Biljecki et al., 2015;Lunetta et al., 1991;Thapa & Bossler, 1992). It implies the modeling method, including the original sources of information possibly processed for the modeling phase (e.g., previous maps, original survey, acquisition and measurement methods, measure processing, storage methods, point clouds, photogrammetric plotting, etc.), the criteria used and the choices made for modeling the final data (represented objects, used level of detail, generalization methods, and any other pre-or post-processing). Many times, it is sufficient to know the resulting characteristics, but in the most complex cases, documenting the process of production of a dataset can help understand it and use it as properly as possible, avoiding the propagation or generation of errors or misuse and misinterpretation of the data.
Finally, (software-specific) implementation details-the software that will use the data (or for which/with which the data were modeled)-must be known, since they also determine choices in the use, selection, and storage of information within the datasets.
Besides these, Kavouras and Kokla (2007) add, among the causes of taxonomic diversity, discipline (field into which data are designed and generated); ethno-/cultural-/socio-based view (nuances in the concept meaning and interpretation of a domain by different cultures or societies, as well as the local geographical terms used); human cognitive diversity (different individuals perceive and conceptualize a domain differently). Therefore, also knowing the authorship of data and the context within which they were generated can help in manipulating them correctly.
Such diversities are not only relevant for taxonomy, but for all of the choices made when producing the data, therefore also affecting the geometric modeling, the format chosen, and so on. Integration must be aware of all the diversities involved.
Metadata should describe as many as possible of the listed parameters, which are critical to assess the data correctly and speed up the integration, guaranteeing a higher quality of data resulting from the integration. Table 1 summarizes the geometric, semantic, structural, and syntactical parameters involved in data definition relevant for integration, which are explained in detail in Appendix A.
3.1.1 | A reflection on (Open) standard-compliant data Structure, as well as semantics and geometry, can be declared compliant with (Open) standard data models. However, this does not per se solve the harmonization-related issues, since many data from practice can implement or interpret the standard data models differently (Section 1.1). In addition, a few variables must be taken into account. as "Construction," to describe construction details, as well as other modules related to the performance of the model, like "Versioning" and "Dynamizer." The concept of space was mainly modeled by "Room" class in CityGML 2, while there is a different and more detailed conceptualization foreseen in version 3 (AbstractLogicalSpace, AbstractPhys-icalSpace, occupied or unoccupied space, and so on). Other changes are made in the core module, introducing the concept of "FeatureType" (i.e., abstractions of real-world phenomena that have an identity) disjoint from "ObjectType" (i.e., objects that have an identity but are not features). It makes the general conceptualization quite different from the previous version, and compliant models can therefore differ over different versions of the same standard accordingly.
Another example, among many others, is the storage of georeferencing information in IFC v.2x3 with respect to v.4, where the storage of more complete information is enabled (Clemen & Görne, 2019).
Besides that, often not the entire standard data model is used in datasets, but only a profile, that is a part of the entire model, according to the needs of applications. It represents the actual data model used by the data and it is therefore relevant to outline it explicitly. The description of the specific interpretation and use of the model is very relevant to enable consistent use and integration.
Similarly, extensions of the standard data models are used to enhance their representation scope for specific applications, by means of foreseen mechanisms, such as the Application Domain Extension (ADE) in CityGML (Van den Brink et al., 2013b). In case official extensions are used, they can be considered similar to a reference data model themselves; therefore, the version and used profile must be verified and compared as well.
Moreover, generics (classes, attributes and relationships) are foreseen in the standard data models ("generics" module in CityGML, ifcProxies in IFC). They provide a structure for objects not covered by any other class, attribute, or relationship in the standard conceptual model. Although the recommendation is to use them only if an appropriate structure is not provided by the remainder of the schemas, in data from practice, they are very often used in place of existing entities. Therefore, even if standard-compliant, many data follow a customized data model (Colucci et al., 2020;. Such variations of standard data models (profile, extension, and use of generic classes) should be documented in proper metadata and documentation associated with the dataset, including, preferably, the formal encoding and the parameters described in Appendix A. It would enable the automation of the mapping and conversion of compliant datasets. However, most of datasets coming from practice do not provide a proper explicit documentation and it makes it necessary to analyze the data manually.
Additional variability could come from different interpretations of the same data model. A translation or conversion or even enrichment/new acquisition can be necessary whether the interpretation of the data model is too far from data requirements.
Adopting Open standards correctly, even if using different profiles and extensions, would give anyway the advantage of speeding up the mapping to support the following integrability assessments.

| Define requirements for the integrated data
As when modeling new datasets, to obtain proper data for the desired application and use case, data requirements for the data to be integrated must be defined. Some standards propose guidelines to define data requirements properly, for example, in the building and civil engineering works domain, the concept of Level Of Information Need is established by the ISO 19650-1:2018 (https://www.iso.org/obp/ui/#iso:std:iso:19650:-1:ed-1:v1:en [Accessed 24th November 2021]) for information stored in BIM. Meanwhile, buildingSMART defines the Information Delivery Specification (IDS) standard to define the exchange requirements in a computer interpretable format, to define the Level of Information Need and allowing data validation (https://technical.buildingsmart.org/projects/information-delivery-specification-ids/ [Accessed 24th November 2021]). In the geospatial domain, data requirements must also be defined, according to the use case for which they are intended (Malinowski & Zimányi, 2006).
It is essential to define such data requirements as well for the information resulting from integration, considering the mentioned standards and covering all the parameters listed in Section 3.1. They will later guide all the following steps of the integration (data retrieval, information selection, harmonization, and processing and data merging/ fusion/combining).
In data requirements, tolerance thresholds can also be set to establish the admitted discrepancy with respect to the defined parameters.

| Data retrieval
In this article, we will not analyze the issues related to data retrieval (findability, usability, licenses, costs, and so on).
However, this does represent a further aspect to be considered in the integration. The legal and policy constraints must be checked against the possibility to use the data for integration and the data requirements (including foreseen use and publication of the data and so on).
If any technical (e.g., the necessary information is not present or not suitable) or non-technical (e.g., costs, license, etc.) issue is found, a new data retrieval phase is necessary.
T A B L E 1 Synthesis of parameters for data integration potential assessment Data retrieval includes both existing datasets and possible new acquisitions, whether it is not possible to find any suitable resource.

| Pre-assessment of data integrability
A preliminary assessment about the effectiveness of datasets integration can be performed by comparing initial information about the geographical extent, time frame, and general scope ( Figure 6). Scope will be later specified as the geometry, semantics, and structural characteristics described in Section 3.5. The preliminary information could be found in good metadata, otherwise it will be necessary to retrieve it, whether possible, by inspecting the data or annexed documentation.
Examples of datasets covering different scopes and semantic coverage are: A-any 3D city model (a) and weather data related to sensors represented in a map (b); B-a 3D city model or map (a) and a BIM (b) or a 2D digital map (a) and a CityGML model containing only buildings, as there are many examples (b); C-two CityGML models (e.g., Figure 1).
In cases A1, A2, B1, C1, in Figure 4, it does not make sense to integrate the data, unless by trying with complex inferences and machine learning or data mining processing (e.g., Perumal et al., 2015;Sheeren et al., 2004;Wang, 2017). In A3 and A4 (e.g., one dataset about buildings and one about roads covering different geographical extensions which, at least partially, overlap), the overlapping area can be integrated to obtain a richer dataset (i.e., a dataset including both roads and buildings).
In the B2 case (e.g., a GIS and a BIM located in areas bordering each other), data can be integrated to obtain a more extended dataset with the overlapping semantic coverage of the two datasets. In the example of GIS and BIM data, I could obtain a new, more extended dataset probably including less information than the original sources, since I should, for example, select only part of the information in the GIS, to keep only buildings, and generalize the information in the BIM to obtain the same representation present in the GIS. In B3 and B4 (e.g., a GIS and a BIM whose geographical extensions overlap), the overlapping area can be integrated to obtain a richer dataset (i.e., with more information) by integrating the information related to different semantic coverage. For example, I can have the information about the building in BIM, such as materials and windows, plus the information about the building in GIS, such as function and address or owner, plus the context elements present in GIS, such as trees and roads. On the other hand, the overlapping semantic coverage of the two datasets (e.g., building) can be used to improve the quality of the data through the integration, by mapping and comparing the data, and selecting the best or most suitable version of the information. A further example could be a map produced for topographic representation of the land and a map produced for running a risk analysis on a specific area, or maps produced by different institutions (e.g., national or regional authority and a municipality).
In C2 (e.g., the topographical maps of two bordering municipalities), data can be integrated to obtain a similar dataset on a wider extent. Mapping and information quality comparison can help in checking consistency on the final data (slightly improved or decreased, depending on details of the harmonization processing).
In C3 and C4 (e.g., the topographical maps provided by the national or regional authority and by a municipality about overlapping areas), in the overlapping areas, the two datasets should be quite similar. Still, they could differ for the geometric representation or differences in data quality, which would make the integration useful to improve the data through mapping and information quality comparison (I can choose to keep the most accurate information, for example). Moreover, the data could be enriched in case they use different profiles of the same schemas or extensions and generics which complement each other.
In the A3, B3, and C3 cases, inference techniques can be assessed for enriching the data in the non-overlapping area based on the integration processing on overlapping part. The relationships of data could be analyzed in the overlapping area for trying to reproduce them in the remaining part starting from the present information. For example, a pattern can be detected between the year of construction of buildings (possibly present in one dataset) and the materials used (possibly present in the other dataset) and the missing information could be inferred for the parts of the datasets which are not overlapping. B1 and C1 could be similar to the B2 and C2 cases, respectively. However, the two datasets could be successfully harmonized and converted to a common format, but the final step of data merging/fusion or combining would not make sense. It must be decided based on use cases whether the effort is useful. For example, it could make sense to harmonize the data whether this allows running the same analysis, using the same tools on the two separate areas (e.g., a flood analysis tool), and possibly compare the results.

F I G U R E 6
Overview of the possible data integration based on the spatial and logical relationships of geographical extent and scope/semantic coverage. In the legend, "richer" information indicates a higher quantity of information (e.g., both buildings and trees, while I had only buildings before); "improved" information indicates a higher quality of the information (e.g., I select the most accurate information about the same object between the two datasets).
Additional discussion would be necessary about the use of data referring to different time frameworks. In this case, whether the spatial extents overlap at least in part (cases 3 and 4 in Figure 4), and the scope and kind of representation as well (cases B and C in Figure 4), the data could be integrated for the overlapping part, at least, if setting the priorities and criteria for merging based on the data requirements. Optimal case would be using data referred to close time frameworks.

| Detailed assessment of data integrability
It is necessary to run a preliminary data quality assessment with respect to the defined data requirements, to answer the question: "Is it appropriate and effective to integrate the data?" For some of the parameters considered, there are quality thresholds deciding the suitability of the dataset (e.g., the minimum accuracy), while in other (most of) cases, the data can be improved through pre-processing in order to reach the needed quality.
In this case as well, metadata and annexed specifications should be analyzed first, to speed up the process whether they report suitable and reliable information.
Tables 2-6 define a rubric to assess the level of integrability of datasets, that is, their compatibility with respect to the data requirements prescribed and the needed pre-processing. Based on the data requirements, availability of alternative data, effort required, and processing options, the user will assess the suitability of the data to be involved in the integration or whether a different choice of data (including new acquisition and processing) should be preferred. In the tables, scores are given based on the scale: 0-the data cannot be used for integration; 1-the data must be pre-processed through complex data mining/machine learning/ data enrichment processing; 2-the data must be pre-processed to be generalized properly; 3-a conversion must be applied; and 4-the data can be used as they are. Table 2 explains the criteria to assess geometry-related parameters. Regarding "accuracy," on the condition that both datasets respect the minimum accuracy required, there are no studies demonstrating specific issues when integrating datasets having different accuracies. Although it is preferable the two datasets have similar accuracy, it is possible to consider valid a dataset coming from two datasets having different accuracies as well. However, future studies will investigate the issue in more detail, to identify possible challenges and thresholds in the difference of accuracy between the datasets involved in the integration.

| Geometry integrability assessment
A minimum condition for georeferencing also applies. Georeferencing information, with the minimum accuracy required by data requirements, can be later converted or inferred, but an indication of georeferencing parameters, reference points or at least a qualitative description about the data location is necessary.

| Semantics integrability assessment
Semantics consist in the concepts expressed by object names, as well as by the terms defining attributes and composing code lists. Those should be described within proper definitions, removing possible ambiguity from interpretations.
Considering both terms and definitions while mapping the concepts and objects represented in two different datasets allows a higher reliability in the similarity assessment (Table 3), as well as considering their features and semantic neighborhood and context (Kavouras & Kokla, 2007;Rodriguez & Egenhofer, 2003). The reliability of the mapping still increases when considering attributes and instances as well. In case of geographical datasets, the spatial dimension of instances can also be used to establish correspondence between features (Rodriguez & Egenhofer, 2003).
Terminological-conceptual conflicts can be: "confounding"-information items having deceptively the same meaning but are actually differing (e.g., a "tie-beam" was interpreted as "IfcBeam" in an example reported by Noardo et al., 2022); "scaling"-from the use of different reference systems and scale; "naming"-from using different terms (homonyms and synonyms) (Kavouras & Kokla, 2007;Wache et al., 2001). T A B L E 2 Level of integrability based on the geometry-related parameters

0-not usable 1-enrichment 2-generalization 3-conversion 4-as is
(1) T1 = T2 (2) T1 ≠ T2 T A B L E 3 Different combination of Term (T) and Definition (D) cases (Kavouras & Kokla, 2007) The possibility to convert between different semantics paradigms, including between different calculation methods and filling criteria for attributes, should be assessed for the specific cases: for example, calculation could be based on further data or values that must be known, etc. Condition is that the semantic paradigm, filling criteria, and methods used are well documented within metadata.
The semantic paradigm must be compatible with the data requirements. For example, Noardo et al. (2016) reports differences in filling the roads classification in Italian and French digital maps, being "paved"/"unpaved" in the Italian maps and classified according to a hierarchy of functions in the French maps. In this case, it is hard to infer or calculate the values from the available data, and a third source of information is likely necessary. Table 5 defines the criteria to assess the integrability of data based on the structural parameters. Table 6 defines the criteria to assess the integrability of data based on the syntactical parameters.

| The final overall assessment
The integrability potential of the dataset can be roughly measured by summing up the scores related to all the parameters considered and reported by data requirements. If any of the parameters has scored 0, the integration cannot be performed and the process should be blocked. The maximum integrability rate is the total number of parameters, calculated as in the following equation: number of involved parameters * 4 (i.e., all the parameters scored 4 and the dataset can be merged as is), while the minimum should be the total number of parameters (all the parameters scored at least 1, i.e., the dataset can be used after an enrichment that is possible). However, this is only a rough assessment, and the processing to harmonize the dataset with respect to each parameter needs to be considered singularly.
Moreover, it should be noticed that the assessment can regard only the part of the dataset which is intended to be used in the integration. On the other hand, some of the parameters can be irrelevant for the data requirements (e.g., there are no attributes or code lists). Checking that all the information prescribed in data requirements is covered by the datasets involved in the integration must be done separately, in the initial phase of data retrieval and/ or later, during the validation step.

| Define the needed harmonization actions
Once the datasets involved are assessed as suitable for the integration (integrability scores 1 to 3 as defined in Section 3.5), a pre-processing must harmonize their characteristics with the ones indicated by data requirements (to reach integrability score 4). In this section, the needed actions are listed, together with possible methods, referring to each parameter and integrability score case. Section 3.6.1 introduces the semantic and structure mapping, as defined within the ontology engineering field. It is preliminary to understand the suggested approach in processing and harmonizing the data in this article.
3.6.1 | Ontologies and data models mapping and integration The "schema mapping", or "schema matching" (also implying semantics) is the definition of an automated transformation of each instance of a data structure A into an instance of a data structure B that preserves the intended meaning of the original information (Doerr, 2004;Morocho et al., 2003). The preservation of the intended meaning must be ultimately judged by the application domain expert. Doerr (2001) defines the principles for mapping thesaurus terms by means of concept-based mapping. Mapping and ontology integration techniques are the tools necessary to solve most of the inhomogeneities in semantics, in "Terms" choice, and in the structure. For this reason, although each parameter is considered as a separate issue in this article, for not neglecting any of them, a mapping between the classification followed by the input data and the one defined by the data requirements is the preliminary step to guide the following processing to harmonize the semantics and structure features.
The techniques and methodologies for mapping different data models or ontologies consider schematic and semantic differences, including syntactic and semiotic/pragmatic heterogeneities in some cases (Kavouras & Kokla, 2007). There are several approaches to integrate ontologies, many of which are based on inter-ontology mapping and alignments between multiple ontologies (Wache et al., 2001).
According to Kavouras and Kokla (2007), ontology integration approaches can be defined according to three dimensions: (D1) the possible change/alteration/distortion caused; (D2) the number of ontologies resulting from the integration process; (D3) the use of a target ontology in the integration process. For the scope of this article, an integration involving possible change/alteration/distortion is admitted (D1), only one data model will result from the integration (D2), and the ontology or data model defined in the data requirements will be used as the target (D3). The destination data model or ontology could correspond with one of the involved datasets schemas or constitute a third one defined by data requirements. A hybrid approach (Wache et al., 2001) is recommended (Figure 7).
Ontology integration consists of (the iteration of) the following steps (Klein, 2001;McGuinness et al., 2000): 1. Matching, that is, the procedure that compares concepts and matches those that are more similar in meaning according to a given context (Kavouras & Kokla, 2007;Klein, 2001), or find where the two conceptualizations overlap (Klein, 2001;McGuinness et al., 2000); 2. Alignment to be generated accordingly (Osman et al., 2021), that is, mapping of correspondences (equivalence or subsumption relation) between concepts in two or more ontologies into mutual agreement (Kavouras & Kokla, 2007) making them consistent and coherent (Klein, 2001), to overcome syntactic (representation format), terminological (naming differences), conceptual (coverage, granularity and perspective) and semiotic/pragmatic (how the ontology is interpreted or used by communities with respect to a context) heterogeneities (Wache et al., 2001).

3.
Merging or integrating ontologies is defined as the creation of a new ontology from two or more existing ontologies with overlapping parts, which can be either virtual or physical (Klein, 2001). Osman et al. (2021) define and analyze in more detail the types of processes that can be involved in the above mentioned cases, as a useful reference for semantic integration processes.
4. Check the consistency, coherency, and non-redundancy of the result.
For the aims of this article, the generation of a new ontology (step 3) is not of interest, but rather, the mapping will be useful to determine the selection, interpretation and processing to convert the semantics and structure of the input datasets into the ones defined within the data requirements.
3.6.2 | Harmonization of the geometry With reference to Table 2, the needed harmonization actions with respect to each geometry parameter are described in this section.

| Accuracy
There is no needed pre-processing to adjust the accuracy, since it is a simple threshold value (either sufficient or not).
In case the accuracy of the dataset is not homogeneous, it should be verified that the relevant contained information, which is necessary for the integration respect the accuracy threshold established.
Depending on the use cases, it can be assessed case by case whether it is necessary to add acquisition uncertainty to possibly too accurate data (Biljecki et al., 2015). However, there is no evidence, at the moment, showing that excessively accurate data could bring issues to the integration process.
Similarly, there is no research so far showing a reliable method to improve data accuracy without new acquisitions. Some studies propose to improve accuracy by referring to a reliable existing dataset (Noskov & Doytsher, 2017).
It could be considered case by case if a similar approach could be applied and what does it entail in 3D.

| Abstraction level
Either enrichment or generalization can be necessary.
1-enrichment) Enrichment is the process of adding missing details to the data in order to increase the amount of information contained. It could use simple processing derived from information contained in the data themselves.
For example, the building footprints could be extruded until the height values stored as attribute in the same dataset.
Another case foresees the use of external sources to get the necessary information to generate higher levels of detail models. For example, to extrude building footprints using digital terrain models and height point clouds as references In the case of generalization of 3D information systems, including 3D city models, BIM and similar ones, two operations must be considered, that is, feature extraction, selecting the objects and features to be considered to compute the generalized geometry, and the generalization itself (Guercke & Brenner, 2009). The phase of feature extraction can be based either on geometry (e.g., topological relationships, distance-based or bounding box criteria) or semantics (e.g., including or excluding specific classes of objects, or selecting objects based on attributes-such as the "isExternal" attribute in IFC).

| Geometric paradigm
The harmonization of the geometric paradigm is done via conversion.

3-conversion)
In this case, the geometries must be converted into a different representation, as defined by data requirements, for example, by means of Extract-Transform-and-Load (ETL) tools (Noardo, Harrie, et al., 2020).

| Topology
Data requirements should specify the kind of topological information needed and the kind of storage of such information. Moreover, a validation procedure should be indicated or developed to check that topology is correctly used both in the storage of geometries and in the reciprocal objects' relationships.
1-enrichment) Data mining, machine learning, and inference techniques  can be used to infer and store a richer topology. Manual techniques could be considered as well in some cases.

3-conversion)
The kind of storage of such relationships (whether implicit or explicitly stored in the models) must also be maintained and made compliant to the data requirements (

| Georeferencing
It implies the Coordinate Reference System (CRS) for planar coordinates and heights, accuracy of the georeferencing, and kind of storage of georeferencing parameters.
1-enrichment) Techniques to infer the correct georeferencing from too vague data (e.g., the address) can be considered. Options include comparing the model to a different spatial representation (e.g., a BIM model to the parcel where it is supposed to lie, the representation of the model in its context in non-interoperable format, such as PDFs) or manual positioning based on qualitative information and description. Other inference techniques can be assessed in order to automate the processing (e.g., Hiebel et al., 2017).

3-conversion)
Re-project to the coordinate and height system prescribed by the data requirements. Techniques can vary based on the kind of data being georeferenced (Jaud et al., 2020;Noardo et al., 2016;Uggla & Horemuz, 2018).

| Unit of measure
A simple conversion can harmonize the unit of measure used.

3-conversion)
The geometries should be scaled to the unit of measure needed by the data requirements. ETL tools as well as other modeling software allow this.

| Harmonization of the semantics
With reference to Table 4, the needed harmonization actions with respect to each geometry parameter are described in this section.

| Terms and definitions
For the three cases of enrichment (1), generalization (2), and conversion (3), a mapping (Section 3.6.1) must be applied after having measured and analyzed similarity.
Mapping can be done automatically, semiautomatically, or manually. Machine learning techniques can also be used (e.g., Doan et al., 2004).

| Vagueness
As for the geometric accuracy case, the semantic data should be no vaguer than what is admitted by the data requirements. It is not possible to enrich the data, because, even if adding more detail starting from the data themselves, the original vagueness would be propagated without reaching the objective.

2-generalization)
In the case of more accurate data, it is not necessary to consider any processing. In some cases, it is possible that a vaguer value is necessary, for example, a classification "high," "medium," and "low," rather than the exact measurement. In such cases, a processing should be applied (and tracked in metadata) to compute and classify the values.

| Approximation
As for the case of geometric abstraction level, either enrichment or generalization can be necessary.
1-enrichment) In case of too vague data, it could be possible to enrich semantics using several techniques, such as ontology-based inferences and machine learning techniques (e.g., Bloch & Sacks, 2018;Dou et al., 2015;Lüscher et al., 2007;Xue et al., 2021;Werbrouck et al., 2020). Attention should be paid to the vagueness value so that it is not affected by this processing.
2-generalization) Whether more generalized entities are necessary, superclasses can be used (either from the classification used by the data or mapping the data to other classifications, for example adopting a different perspective). The same techniques for data enrichment could be used, with the different objective of detecting superclasses, in case of differences in the semantic paradigm.

| Semantic paradigm
As for geometry, a conversion can harmonize the paradigm.

3-conversion) The criteria used in defining the conceptualization and filling the attributes have consequences on
the resulting meaning of the entities and attribute values and need to be made homogeneous. Whether they are filled with a completely different perspective, objects, and attributes must be recalculated or adapted. A transformation, defined through a mapping to a different conceptualization, must be applied, that is, changing the semantics slightly (possibly also changing the representation) to make it suitable for purposes other than the original one (Klein, 2001).
To make a simple example, the mapping of IFC classes to CityGML classes would produce the conversion of IfcRoof to bldg:roofSurface and IfcWall with attribute "External" to bldg:wallSurface. The concepts are slightly different in the two models, although indicating similar objects, since IFC is intended for construction purposes and CityGML for

0-not usable 1-enrichment 2-generalization 3-conversion 4-as is
Terms Cases (d) ( T A B L E 4 Level of integrability based on the semantics-related parameters city analysis goals. Such a distortion needs to be documented and tracked (Kavouras & Kokla, 2007). Moreover, the units of measurements for attributes' values must be converted to comply with the data requirements prescriptions.

| Language
A language translation (3-conversion) must be applied according to the needs of the application as reported by data requirements. Multilingual thesauri can be used whenever they are available (Doerr, 2001). The buildingSMART Data Dictionary (http://bsdd.buildingsmart.org [Accessed 30th November, 2021]) is an example.

| Encoding
A conversion (3) is necessary in this case as well.

| Harmonization of the structure
With reference to Table 5, the needed harmonization actions with respect to each geometry parameter are described in this section.

| Granularity
Enrichment must be applied in case more detail is needed in the conceptualization, or generalization, in case higher level concepts are necessary.
1-enrichment) After the mapping, inferences techniques should be applied to specify the objects to a finer granularity (e.g., some "constructions" will become "buildings").
2-generalization) After the mapping, generalization techniques can be applied to generalize the object's conceptualizations, for example, by using superclasses in the reference classification (e.g., "buildings" and "infrastructures" will become "constructions").

| Semantic paradigm
After the mapping, the representation must be expressed according to the semantic paradigm defined in data requirements (3-conversion). T A B L E 5 Level of integrability based on the structure-related parameters

| Relationships
They can be enriched, generalized, or simply converted.

1-enrichment) Relationships between objects can be inferred through the previously mentioned inference tech-
niques, whether the needed information is stored within the dataset. For example, the apartments represented in an IFC file can be detected by starting from the mutual relationships (either semantic or geometric) of the building elements and the result stored back in the IFC file according to the proper entity (van der Vaart, 2022).

2-generalization)
The mapping can support the generalization of relationships, guided by the destination conceptualization. For example, the hierarchical relationship between "building part" and "city object" may be needed while the relationship between "building part" and "building" is not of interest. Therefore, it could be removed from the dataset, or better translated to a relationship with "city object." 3-conversion) The mapping must also regard the relationships between entities and the input dataset must be converted accordingly. This is the case, for example, in which the relationship is the same but is called differently, such as "lives in" in one dataset and "is resident" in another one.

| Harmonization of the syntax
With reference to Table 6, the needed harmonization actions with respect to each geometry parameter are described in this section.

| Data format
Encoding language and the representation formalism, including version, must be considered.

3-conversion) ETL tools, which can also be embedded into GIS tools or other models' exporters, can generally
apply the conversion to the desired data formats . In the ontology engineering field, it is called "translation": changing the representation formalism of an ontology while preserving the semantics (Kavouras & Kokla, 2007;Klein, 2001).

| Objects' behavior
The mapping will be the reference tool to guide this processing as well in the three cases of enrichment (1), generalization (2) and/or conversion (3).

| Final data fusion and validation
Although in some studies the term "data fusion" is used to indicate the overall integration, here it is intended to represent specifically the merging of geometric data overlapping on the same extent or border, by resolving the remaining conflicts after the harmonization, which are due to differences in the objects represented by both the sources. For example, discrepancies in DTM heights on certain areas, or in the shape or presence of buildings, and so on.
First, priorities should be decided to choose the most reliable data to be maintained in the integrated data after data fusion. Such priorities should be: (1) time (most recent source should be trusted in case of discrepancies); (2) quality (most accurate, less vague source should be preferred, as well as the closest source to data requirements prescriptions); (3) interest (the source representing the objects of interest, if these are not present in both). Those criteria should be reassessed based on the specific data requirements.
More sophisticated processing could include the editing of 3D geometries or addition of 3D details to obtain a richer dataset from the merging of two different representations. This should precede the mapping of features or objects, possible (parametric) modeling phases or other kinds of merging, depending on the manipulated objects and kind of representation. Further studies, not performed so far, will be necessary to investigate the issue in more detail.
For example, if might be useful to add specific building details to an already well-formed 3D city model.
The methodology to perform the final merging can be chosen among different options, based on the kind of data considered and the remaining discrepancies. In the final integrated dataset, the two origin datasets must be no more recognizable, as much as possible, therefore holes and discrepancies in heights must be smoothed. A maximum tolerance in such discrepancies can be considered as the geometric accuracy established by the data requirements.
The integrated data obtained should be finally validated against the data requirements. Although it is not covered in detail in this article, validation is an essential phase of the integration, since it allows the assessment of the integration success and outlines whether the obtained data can be effectively used for the defined use case or not.
In case data requirements were expressed in a formal, machine-readable, language, an automatic validator could be programmed, which is able to read the customized data requirements and check the data against them. However, at present, it would be hard to automate the full validation process. Even if, most likely, the single aspects of the data (e.g., geometry, structure and so on) can be validated separately, possibly with automatic validators (http://geovalidation.bk.tudelft.nl [Accessed 1st December 2021]) whether they are quantitative or formalized.

| A CASE STUDY TO EXEMPLIFY THE PROPOSED APPROACH
A case study was chosen to exemplify and iteratively design the proposed approach in this article. The goal of the integration is the update of a 3D city model (CityGML LoD0 and LoD1) (Figure 8) with the data coming from a BIM (IFC) representing the architectural model of a designed building, so called "Terraced tower" (Figure 9), likely delivered for the digital building permitting procedure. It is a rather simple case, to show how the proposed framework can be used for data integration in practice.

| Data requirements definition and integrability assessment
The data requirements, in this case, correspond with the characteristics of the destination model, that is, the LoD1 CityGML model of Rotterdam. In Tables A1-A5, in Appendix B, each parameter is analyzed for the destination model (data requirement) and in the input model, that is, the IFC model of the Terrace tower building. In the last column, the integrability (and, consequently, needed harmonization processing) is assessed and commented according to the framework proposed in Section 3.5.
None of the scores given is 0, meaning that the two datasets can be integrated. Moreover, the minimum score is 2, so that the necessary processing is generalization for some parameters, while others can be used as is or only converted to the destination format.

| Processing of the IFC model towards harmonization
As pointed out by the specific assessment (Table A1), the geometry needs to be generalized, converted into meters from millimeters and stored into a different format.
For doing this, the IFC geometry was processed to extract the footprint of the building and the maximum height.
ETL tools can be used. In this case, the GEOBIM_Tool (https://github.com/twut/GEOBIM_Tool [Accessed 2nd December 2021]), developed for a project on the digitalization of the building permitting procedure in Rotterdam (https://3d.bk.tudelft.nl/projects/rotterdamgeobim_bp/ [Accessed 2nd December 2021]) , was used to extract the footprint from the IFC model and to measure the building maximum height. The tool automatically scales the model into meter unit of measure.
The footprint was used to generate: the footprint polygon (lod0FootPrint) at the ground level; the roof edge polygon (lod0RoofEdge), generated by storing the same polygon at the height of the maximum height measured from the IFC model; an extruded solid representing the 3D building (lod1Solid). This processing embodies the conversion step required.
The extrusion can be generated in any GIS or ETL processing tool from the footprint and maximum height of the building, using a similar approach as the one used for modeling the 3Dfier Rotterdam model. Finally, the geometries can be exported to GML.
Due to the chosen approach for the processing, the IFC semantics can be useful to select the objects that need to be considered in the processing. The objects which are not part of the ifcBuilding, such as the parts of the model belonging to the site, outside the building, need to be excluded from the geometric processing. The conversion necessary, in this case, must consider the entire representation paradigm (both geometrical and semantic/structural): we need to be aware that in the IFC model, ifcBuilding is a class that groups other building elements represented as objects with their own geometry, while in the CityGML model, it stores one only object with its own geometry(ies).
In the storage of attributes and attribute values, there is no need for conversion, since, the mapping, there is no attribute in IFC to store the maximum height of the building. However, the parameter measured from the IFC geometries is stored in the correct place to match the destination data encoding and syntax.

| Final steps: Data fusion, validation, and metadata update
The data need to be merged. Assessing and resolving the conflicts is quite simple in this case. The data coming from the processing of the IFC model will be substituted in place of the outdated 3D city model, to generate an up-todate version of it. The bordering objects (e.g., roads, land cover) need to be modified to obtain a watertight model ( Figure 10).
Versioning techniques need to be considered for keeping track of the integration as a change in the data. Since the resulting dataset is CityGML compliant, it can be validated with the online validator val3dity (https://github.com/ tudelft3d/val3dity [Accessed 6th June 2022]). A similar approach could be used also in more simple cases, for example, to update 2D digital maps (e.g., Figure 11)

| DISCUSSION
The proposed methodology supports a point-by-point analysis to obtain an actual integration of datasets with respect to use case data requirements. The proposed framework joins the efforts made within different fields, such as ontology engineering, 3D survey and 3D city modeling.
Due to the high complexity of the issue, it is not possible to provide only one solution, but the overall framework and methodology is proposed, which was not available in the literature before as a comprehensive workflow and reference, but with specific focus on single aspects. Some options for managing and processing the different aspects are suggested for each case, as proposed in the literature. The needed and available options to process the data with respect to each outlined parameter must be investigated in detail for each case, since the range of existing data is too heterogeneous and the use case requirements may be very specific for each case, which makes any suggestion for specific processing not effective. However, the proposed framework can be used for any kind of integration, representing a solid reference for future work.
F I G U R E 1 0 Processing followed for the integration (harmonization + data merging) as planned according to the initial assessment.

| The automation of the methodology
The integration framework, as defined in this article, is very complex, including qualitative assessments in some cases.
Therefore, it is hard to propose a fully automatic procedure to integrate datasets. At this moment, the main aim of this approach is to have a transparent and clear methodology for effective data integration that can be reused in the future and at the same time highlight possible required improvements of individual data sources.
An automated workflow should be constituted by many components, in order to process the data with respect to each parameter, to be previously measured and assessed. The level of possible automation would increase according to the metadata's quality (they should be correct and specific), completeness (all the parameters should be well documented and explained) and the formality of the storage language (they should be machine-readable).
This would allow an algorithm to choose or suggest harmonization processing for each parameter and final data merging. However, at the moment, considering the available data (and metadata) from practice, automatic procedures can only be chosen to support the pre-processing harmonization steps for the single parameters. Moreover, some automatic or computer-assisted procedure can be used to extract the needed parameters from the data even if these are not properly documented (e.g., Kavouras & Kokla, 2002). A manual or semi-manual guided analysis is still the preferable choice.

| An overall workflow
The group of actions needed to convert the data into the integrated dataset can be implemented in different workflows. The kind of implementation and tools have to be chosen case by case, according to the complexity of the conversion, the level of possible automation and the need for repeating the process several times. They range from manual workflows, in which the actions intended to solve each of the detected inhomogeneities are launched manually step by step, to completely automated workflows, implemented in ETL tools, such as the Safe Software FME (https://www.safe.com [Accessed 3rd December 2021]), in software like GIS processing models, or in other bespoke

implementations.
Currently, many studies developing converters of data, for example, between BIM and GIS already consider several of the listed aspects together, most frequently the data format and data schema/semantics are taken into F I G U R E 1 1 Example of the use of extracted data to update a 2D digital map (a, before; b, after the update), following the proposed workflow.
Few examples are available about the complete workflow from data assessment to pre-processing and harmonization and final data merging into real integration.

| Discussion about metadata
In order to allow a fast assessment, suitable metadata should contain up-to-date and specific information about all the mentioned points. However, many of them are not foreseen in the current standard metadata schemas.  .
CityGML is not included in Table 7 because it has limited metadata support (i.e., only name of the dataset, bounding box, coordinate system), mostly optional and inherited from the GML encoding format. In some cases, more information is added as comments in the XML file, but there is no control and no shared prescription about what information must be provided and in what format (Labetski et al., 2018). Metadata support was not included in the version 3 of CityGML either.
Similarly, many IFC files can contain some additional metadata information in a commented section in the STEP file in which they are encoded (e.g., author, schema, creation date).
The HEADER of the STEP files can also host some additional information (e.g., description, time_stamp, author, originating_system, FILE_SCHEMA). However, they do not follow a specific prescription and are related to the generation of the STEP file, according to the specific implementation choices of software exporting it.
In Table 7, only the metadata regarding the technical details of the data are considered, while others, related to non-technical aspects (licensing, use and retrieval of data, publication details) are not reported.
As already pointed out by Labetski et al. (2018), many attributes about the datasets, which could be relevant for their correct interpretation and (re)use, as well as for their integration, are currently missing in the official metadata schemas. They link to external specification documents, in some cases (e.g., INSPIRE) but without guaranteeing or prescribing anything about the information there contained.
In the table, we can notice how only a minimal part of the useful attributes is covered by metadata schemas. Both

| CONCLUSIONS
The topic of multisource spatial data integration is relevant to many use case applications, such as GeoBIM, digital twins, governance digitalization, and land analysis. GeoBIM is the specific integration of geoinformation with BIM and can, in turn, serve several use cases (digital building permits, maps updating, asset and facility management, terrain, functions, traffic data, noise barriers parameters etc.); road infrastructure maintenance analysis (geoinformation, traffic data, and transport infrastructure details). Data integration allows saving resources to generate new data by reusing the existing datasets, as well as to enable automation of several tasks. However, the high level of complexity of 3D information systems, such as BIM or 3D city models, and their, even conceptual, distance to each other make the integration workflows hard. As a consequence, the integration attempts often remain partial. This article provides a reference for spatial data integration to support use cases applications, by proposing a comprehensive workflow and methodology considering in detail all the data parameters involved in the integration: geometry, semantics, structure and syntax. By following the provided methodology, a proper and consistent harmonization and merging processing can be planned, to obtain integrated datasets usable in practice.

Contents and procedures
Because of the complexity of the involved components, this study could not provide one final solution for each of the parameters and the workflow steps. Moreover, the software tools to process the data are continuously evolving, as well as the modes to edit the models. Therefore, mentioning specific solutions would have been reductive with respect to the range of possibilities available now and in the future.
This article outlines a workflow and framework guiding a suitable multisource data integration by considering the needs of use cases applications as well as the usual characteristics of the datasets that can be found in practice.
Following the described workflow will allow obtaining data concretely re-usable, by means of a systematic approach, without neglecting any of the features defining the data and avoiding therefore any issue, at the moment, of re-using them within applications.
Future work will be directed at testing in more detail each variable detected in the assessment matrixes and testing the methodology with more cases and more complex data.

ACKNOWLEDGMENTS
This study has received funding from the European Union's Horizon 2020 research and innovation program under the Marie Skłodowska-Curie Leading Fellow Postdoc program, grant agreement no. 707404, project "Multisource spatial data integration for smart city applications." I would like to thank the 3D geoinformation group at TU Delft, who gave great support and inspiration to my research.

CONFLICTS OF INTEREST
There are no conflicts of interest.

DATA AVAILABILITY STATEMENT
The data that support the findings of this study are openly available in Zenodo at https://zenodo.org/record/5786657#.
The level of abstraction corresponds to the concept of cartographic generalization for traditional maps and can be seen as a joint concept of scale and resolution in traditional cartography (Laurini & Thompson, 1992). It allows representing the objects on the map applying the appropriate selection, simplification, symbolization and classification for them to be understandable for a specific scale and purpose (Duchêne et al., 2014;Gaffuri, 2011;Stoter et al., 2014;Worboys & Duckham, 2004). In case of 3D models, the Level of Detail (LoD) concept applies, first defined by CityGML (Biljecki et al., 2016;Gröger & Plümer, 2012). Other kinds of simplification, such as the Level of Development used in Building Information Modelling (Latiffi et al., 2015), are not relevant in this context, since they do not represent an abstraction from a model most faithful to reality but indicate instead the stage through the path of design and improvement. It should be therefore considered in the retrieval of data phase, to assess whether they are suitable for integration, but in the case of BIM, the reference from which to abstract more generalized representations should usually be the final design or the as-built BIM.
The same geometry can be represented, modeled, and stored following several alternatives or "paradigms" (raster,  Ohori et al., 2018;Ledoux, 2018;. The applications using the geometry for analysis or further processing (i.e., not only for visualization) need specific input. Therefore, depending on the use cases, and as defined accordingly in the data requirements, the data to be integrated must use the same kind of representation and storage of the geometry, in order to be suitably recognized and used properly.
Topology can be within one object, as part of the storage and representation of geometry, and between two objects, as spatial relationships (Ledoux, 2018). These characteristics can have an influence on the use for which the models are intended, as well as other constraints (e.g., Cockcroft, 1997).
Consistent georeferencing is an extremely relevant premise of any integration. It must take into account the used coordinate reference systems, both for planar coordinates (X, Y) and for heights, including: datum, projection, coordinate system, accuracy, and measuring systems (precision, accuracy, reliability, etc.). For example, data acquired with smartphones' GNSS sensors or from crowd mapping can have discrepancies with respect to similar data acquired by means of more precise instruments (Dabove & Di Pietra, 2019;Haklay, 2010) that can be relevant depending on use cases.
Unit of measure deals with making the represented objects homogeneous with respect to the scale or precision (Laurini & Thompson, 1992) with which they are represented. In some cases, on-the-fly transformations (e.g., in GIS) allow correct visualization and, more seldom, analysis of the data. However, in most of cases it would be necessary to re-scale the data to the same unit of measure.

A.2 | The semantic parameters
Semantics consists of entities, or classes of represented objects, their attributes and the foreseen values of codelists, which are used in the models. Relationships between those are covered in the structural features.
Attributes are the thematic properties of objects. In turn, they can be represented by different terms and defini- Moreover, in the description and mapping of entities, attributes and codelist values, it is necessary to analyze the features listed in Table 1: used term, vagueness, approximation, semantic paradigm, language and encoding.
In some cases, geometric properties, such as spatial relationships and topology, could also be stored as semantics. It is relevant for integration to assess the mode of storage of these characteristics and consider it according to the data requirements definition. It could be in fact necessary to either remove some relationships explicitly stored as semantics (attributes or relationships), whether not necessary in the final data, or calculate and infer them and store them explicitly whether this need is foreseen. One example of this could be the grouping of storeys in IFC files, which could be essential for some applications, while not relevant for others (such as the case study considered in this article).
"Term" is the name used to indicate each concept: entity (class), attributes, and codelist values (Kavouras & Kokla, 2007), as defined within the ISO25964 (Dextre Clarke, 2011;Dextre Clarke & Zeng, 2012). "Definitions" help in defining semantics in the least ambiguous way, and sometimes include examples which further clarify the meaning.
These can be compared to each other to support concept mapping and integration (Kavouras & Kokla, 2007).
"Vagueness" (or "semantic accuracy") is described by Kavouras and Kokla (2007) as "the degree of inexactness, fuzziness or indeterminable character of geographic concepts, properties and relationships. Uncertainty, randomness and ignorance contribute to the parameter." Vagueness refers to the inability to clearly understand the meaning of a concept in a context. Examples of meanings that might not always be clear are "large," "high," and "dense." Storing materials in BIM as just "wood" or "glass" can be vague as well for construction-intended purposes.
A different kind of vagueness is related to the source of information. For example, data coming from inferences or enrichment processing of the data will be vaguer than data acquired by survey or authoritative sources.
"Ambiguity", in contrast, is related to the existence of more than one specific meaning, which can be interpreted in different ways. For example, an ifcWall can be either loadbearing or not. There are several ways to understand or specify this, for example, the specific attribute can be used, or it can be assessed based on the disciplinary model being considered (whether structural or architectural, for example). Other examples can be in the interpretation of aerial imagery, whether green roofs are represented, which can be interpreted as grass field, and similar cases. It can be solved by specifying the context. Context (Kavouras & Kokla, 2007) is the restricted conceptual milieu giving meaning to the concept expressed. In the data models, it is usually described in the definition of each term/entity/ class. Constraining the interpretation of each description, likely with examples, is also important to obtain consistent data.
"Approximation" (or "semantic abstraction level") has to do with the granularity of the conceptualization and the level of detail reached by the semantic description.
"Semantic paradigm" is the reference reality and perspective for the conceptualization of the semantic representation of the data (Kavouras & Kokla, 2007;Klein, 2001). Within this feature, we can also include the criteria used to fill the attribute values or methods to be used to calculate them, as well as the unit of measure used.

A.3 | The structural parameters
The structure, or schema, of thematic information is described in the data model or the ontology followed by the data.
Although being slightly different artifacts (Spyns et al., 2002), the principles and features on which the integration of data models or ontologies depends can be considered similar.
Ontology science (e.g., Kavouras & Kokla, 2007;Mostafavi, 2006) provides useful tools and theory with respect to data structure integration. In some cases, the foreseen situations for ontology integration are more complex than what is usually found in data models structuring data from practice (e.g., multiple inheritance is hardly present, or impossible, in data from practice). However, the concepts formulated can be reused to guide the integration of different semantic structures.
The "is-a hierarchies" must be considered, including the distance of a node (a concept) from the root of the ontology or data model and possible multiple inheritance. Similarly, "part-of meronymic hierarchies" are relevant to assess similarity of two structures.
"Granularity," intended as the smallest and biggest objects represented, is the parameter useful to compare different data structures, together with the "paradigm" according to which the reality is interpreted and conceptualized in the data schema. The two parameters must be considered for both the is-a hierarchies and the part-of meronymies.

A.4 | The syntactical parameters
For the syntactic level, the "format" encoding the data is relevant (e.g., GML, JSON, STEP, TIFF, shapefile, ASCII and so on), including the version of the implementation languages used and all the conditions and choices possibly adopted.
National language used is relevant (English, Italian, Dutch, French…), as well as the encoding of the terms and values for entities, attributes and codelists (Klein, 2001). For example, the use of uppercase or lowercase letters and punctuation; storage of dates as dd/mm/yyyy or mm/dd/yyyy; codelists referring to a code (e.g., a number or an alphanumerical string) or containing the value directly, and so on.
"Objects' behavior" (in object-oriented spatial databases, operations and axioms that define them) (Egenhofer & Frank, 1992) is also amongst the features relevant for integration, although models from practice hardly reach such complexity.

Visualization of city objects
Building design

Input data Assessment
Accuracy 20 cm 1 cm (design precision is 1 mm, the value is lowered to take into account the possible discrepancy between design and construction) Semantic paradigm According to the city representation scope According to the building design and construction scope 3

Relationships
No relationships represented --

T A B L E A 4 Integrability assessment based on the structural parameters T A B L E A 3 (Continued)
Destination data (data requirements)

Attribute values
Terms --

Vagueness
We cannot know the accuracy of measurement, but the precision is 1 cm We can measure maximum height with centimeter accuracy with respect to the designed building. However, this one should be measured and checked during the final construction testing and validation 4 Approximation ---Semantic paradigm According to the city representation scope According to the building design and construction scope (it implies that, for example, we could even store small details, such as roof furniture and installations)