An ontology matching approach for semantic modeling: A case study in smart cities

This paper investigates the semantic modeling of smart cities and proposes two ontology matching frameworks, called Clustering for Ontology Matching‐based Instances (COMI) and Pattern mining for Ontology Matching‐based Instances (POMI). The goal is to discover the relevant knowledge by investigating the correlations among smart city data based on clustering and pattern mining approaches. The COMI method first groups the highly correlated ontologies of smart‐city data into similar clusters using the generic k‐means algorithm. The key idea of this method is that it clusters the instances of each ontology and then matches two ontologies by matching their clusters and the corresponding instances within the clusters. The POMI method studies the correlations among the data properties and selects the most relevant properties for the ontology matching process. To demonstrate the usefulness and accuracy of the COMI and POMI frameworks, several experiments on the DBpedia, Ontology Alignment Evaluation Initiative, and NOAA ontology databases were conducted. The results show that COMI and POMI outperform the state‐of‐the‐art ontology matching models regarding computational cost without losing the quality during the matching process. Furthermore, these results confirm the ability of COMI and POMI to deal with heterogeneous large‐scale data in smart‐city environments.

with heterogeneous large-scale data in smart-city environments.

K E Y W O R D S
clustering, ontology Matching, pattern mining, semantic modeling, smart city

INTRODUCTION
Today's World-Wide Web has billions of web pages, but the vast majority of them is readable by human (in HTML format). As the result, machines cannot understand and process this information, so much of the web's potential goes untapped. To do this, the researchers created the Semantic Web, where ontologies describe the semantics of data. When data is in the form of ontologies, machines can better understand semantics and therefore locate and integrate data for a wide variety of tasks. On the Semantic Web, data comes from many different ontologies, and processing information through ontologies is not possible without knowing the semantic links between them. Ontology matching is the process of finding the mappings between two ontologies represented in different domains. It can be applied to several real-world problems, such as biomedical data, 1 e-learning, 2 and Natural Language Processing. 3 Cities are rapidly growing as they strive to accommodate more than 2.5 billion smart citizens by 2050. Understanding city dynamics is crucial to harmonizing internal conflicting demands in housing, business, leisure, mobility, energy, or ecology, as well as managing external shocks. Heterogeneous data in smart cities is rapidly growing in volume and types, which makes ontology matching play an important role in smart-city semantic modeling to improve city planning knowledge.

Motivation
Trivial methods for ontology comparison analyze the ontology instances by considering all the characteristics of both ontologies. Thus, it takes the number of n × n ′ × m × m ′ comparisons to find the alignment, where n and n ′ are defined as the numbers of instances, and m and m ′ correspond to the numbers of the data properties of the first ontology and the second ontology, respectively. Ontology matching is a polynomial problem since many instances and properties are required to be considered for high-accuracy matching. For instance, if we consider a large-scale dataset, such as DBpedia 1 with 4,233,000 instances and 2795 different properties, 144 × 10 18 comparisons are needed. This results in a very time-consuming matching process. The DBpedia ontology and its number of properties are shown in Figure 1 to support this declaration of the computational complexity using well-known ontology matching algorithms: Extended Inverse Functional Property Suite (EIFPS) 4 that is a semi-supervised learning approach. Shao et al. 5 then introduced an iterative matching framework using a blocking technique to minimise the number of comparison. For data properties of less than 10%, the runtime of both models was 878 DJENOURI et al. less than 20 s, the results are obtained with an Intel i7 processor and 16 GB of main memory. However, these approaches have runtimes greater than 700 s, with data properties equal to 100%. More sophisticated solutions to ontology problems attempt to improve the matching process by exploring the search space with the partitioning algorithms, 6-10 high-performance computing (HPC), [11][12][13] and evolutionary computation approaches, [14][15][16][17] among others. However, the overall performance of the ontology matching still needs improvements in particular for complex applications such as related to smart cities. Data mining aims at discovering the relevant information, knowledge, and/or hidden patterns from large and big databases. Clustering 18 and pattern mining [19][20][21] are well-known data mining tasks that are aimed at partitioning the whole data into similar groups to study the correlation among the different data features. Clustering and pattern mining have been also applied to ontologies 6-8 by considering description logic to decompose ontology database into several modules that can be used to study the relationships between the relevant concepts of the given ontologies. However, the above approaches cannot be straightforwardly applied to the matching problem among different ontologies since they cannot extract the smallest modules from complex ontologies. Moreover, a higher computational cost is required when the data is huge. Motivated by the success of clustering and pattern mining in solving several complicated problems, such as information retrieval, 22 traffic transportation, 23 and business intelligence, 24 this paper presents a data-driven approach and outlines how these powerful data mining techniques can be explored to solve the problem of ontology matching.

Contributions
To our best of our knowledge, this is the first study that explores the methods of clustering and pattern mining to solve the ontology matching problem. Furthermore, a case study on smart-city semantic modeling is shown to demonstrate an application of this work. The main contributions can be summarized as follows: 1. We present a new framework, called Clustering for Ontology Matching-based Instances (COMI), which adopts clustering techniques to decompose the set of instances of the given ontologies. The framework can group the most relevant features into a cluster, which can be used to greatly improve the matching problem of different ontologies. To speed up the computation of the ontology matching, an improved k-means algorithm 25 is proposed to deal with clustering of the instances within the ontologies. 2. We present a new framework, called Pattern mining for Ontology Matching-based Instances (POMI), which adopts the pattern mining techniques to study the different correlations among the data properties. The designed framework obtains the most relevant features by exploiting frequent pattern mining on both ontologies. To speed up the computation of the whole ontology matching process, an improved SSFIM algorithm, 26 with an efficient pruning strategy is proposed to deal with pattern mining-based instances within the ontologies. 3. Extensive experiments were carried out to demonstrate the usefulness of the proposed frameworks COMI and POMI. The results reveal that both COMI and POMI outperformed the state-of-the-art ontology matching algorithms in terms of runtime while obtaining high-quality solutions. 4. A case study on smart-city semantic modeling is shown to demonstrate the validity of COMI and POMI in dealing with big and heterogeneous data in smart-city environments.

Outline
The rest of this paper is structured as follows. Section 2 discusses related works in the ontology matching problem. Section 3.1 gives the formal definition used in the ontology matching problem. Section 3 presents the COMI framework whereas Section 4 introduces the POMI framework. A performance evaluation of the COMI and POMI frameworks is provided in Section 5. Finally, Section 6 draws the conclusions and future work in the ontology matching problem.

RELATED WORK
Several approaches have been introduced in the last decade to solve the ontology matching problem. [14][15][16]27,28 Matching strategies based on instances are also appropriate for connecting database records. 29,30 Much research has explored methods for improving the efficiency of ontology matching. Solutions regarding the ontology matching problem can be categorized into two groups: (i) solutions based on the reduction of the search space by employing computational intelligence, data mining, and machine learning methods; [6][7][8] and (ii) solutions based on HPC while parallel matching is established. [11][12][13] This work focuses on the solutions based on the reduction of a search space and approaches in this category are overviewed in the following section.

Traditional techniques
An instance matching approach, named VMI, was developed by Wang et al. 31 For each instance, it builds two distinct vectors, such as the vector name and the virtual document vector. The VMI method reduces the number of similarity measurements by using multiple indexing and candidate selection and operates effectively only in large cases with a limited number of data properties. The best results are obtained when users specify all the corresponding data properties and methods of retrieving the values. Thus, their approach is based on a generic instance matching algorithm, whereas some processes are applied to particular domains; that is to say, a simple string comparison of names and data characteristics is utilized for obtaining comprehensive instance information. In the 2009 OAEI competition for small ontology datasets, VMI obtained successful matching. However, with increasing instances, its quality decreases. Li et al. 32 developed an approach that is based on the hypothesis that, two entities of the same real-world object may be matched when they are related to previously matched entities. This technique incorporates multiple lexical matches using a new voting aggregation process and only uses the structural information and the correspondences observed to locate the additional information, which can then primarily be broken down into two stages: 1. Identification of highly accurate seminal correspondences by lexical information.
2. The derivation of additional matching outcomes based on the semantic matching of the previous stage with a structural matching strategy.
Based on the findings of the 2010 OAEI study, this method obtains a reasonable accuracy for certain medium and small ontology databases. Hu et al. 5 presented RiMOM at the OAEI competitions in 2013 and 2016. It introduces an iterative matching framework in which the distinctive information is centered on a blocking technique for minimizing the number of pairs of candidates. As a key to the index of the instances, it uses predicates, and its distinctive object. Moreover, a weighted, exponential similarity averaging method is used to ensure that the instance matching fits with the high precision. The new blocking approach decreases the computational cost significantly without losing precision and recall. RiMOM achieves 99% accuracy in small and medium ontology datasets. Alam et al. 33 developed an expansion of MERGILO, a method to reconcile knowledge graphs extracted from the text by graph alignment and word similarity. Compared with the generic approaches, the results of the extended MERGILO show significant improvement. Rosaci 34 found that ontology matching can be used to link various smart agents. The ontology of an agent simulates the actions of an agent, and, if an agent proposes, then any agent in the group will know the relation between itself and another agent. Rosaci 35 then used the hierarchical model to identify semantic associations between web data. The semantic connections represented by metadata are discussed in the context of a collection of network entities. The usefulness of this approach has been demonstrated in well-known web user recommendation systems. The interlinking issue was first addressed as problems of duplication or record linkage by the database community, where Elmagarmid et al. 36 based their research on several methods to tackle the problems of heterogeneity in ontology matching and proposed a method of handling a set in organized property-segmented documents.

Data mining-driven solutions
Linked open data (LOD) is data that is structured and interconnected with each other, so that it becomes more useful by semantic queries. To address the matching problem in LOD by using rules taken from the association rule mining technique, Niu et al. 4 developed the EIFPS technique, which is considered to be a semi-supervised learning approach. A limited number of current matches owl:sameAs are used as seeds and the related rules as criteria for optimizing precision are considered. The authors presented a graphic metric that measures the likelihood and law of Dempster while integrating confidence values. The theory makes it possible to combine instances from different datasets and to arrive at a degree of belief that takes into account all the available instances. The degrees of belief may or may not have the mathematical properties of probabilities; they differ depends on the degree of correlation between the two data sets. Then, by presenting the power of resource homogeneity for the e-learning context, Sergio et al. 2 presented the LOM framework. To expand and improve the available tools for online learning semantically, the use of the initial associative classifier for ontology matching was then developed and investigated. This model uses a feature-based similarity function that needs historical knowledge as the training set. This method was evaluated and verified at the 2014 OAEI ontology database competition.
The results for several larger ontology databases showed 90% precision. Ochieng et al. 37 presented an approach that splits an ontological graph into many partitions. Cluster-based similarity aggregation (CSA) 38 is a system integrating varied factors (i.e., five measures, a string-similarity calculation, and a WordNet-based similarity measure) to derive the alignment of ontology concepts. Algergawy et al. 39 then proposed a large-scale ontology matching clustering approach. The main concept is to divide the schema graph by using context-driven structural node similarities into clusters. The Vector Space Model is also defined after the partitioning of each ontology to discover similar clusters and generate the same concepts. In the context of smart-city semantic modeling, several ontology matching based solutions have been proposed. Bellini et al. 40 introduced a system for the management of large-volume data from a range of sources that considers both static and dynamic data in smart cities. Qui et al. 41 developed a semantic graph-based method by incorporating semantic graph structure information and context information that can be used to identify the nontaxonomic relationships in smart-city environments. A unified consolidated and live view for heterogeneous city data sources was given by Le et al. 42 It addresses billions of historical and current records together to accumulate and enrich millions of triples for linking to a graph in real-time per hour. Qui et al. 43 proposed a graph method for semanticizing knowledge accurately from heterogeneous information on smart cities. Smart-city data are first computed with the word co-occurrence as a result of similarities. A semantic graph is then constructed based on the similarities between the smart-city data. A community detection algorithm is finally used to divide the smart-city data into different communities where each community acts as a concept.

Tools
Several works regarding review and analytics have been studied and analyzed for finding the ontology matching solutions that are discussed and studied here. Through analyzing the state-of-the-art matching issues, Shaviako et al. 27 evaluated the matching problem solutions. Assessments and application analyses were provided using the competitive OAEIontology databases competition 2 . Abubakar et al. 29 studied the current ontological situation rather than popular conceptual matching with specific considerations of ontological instance-based matching. To estimate relative effectiveness and performance, Nentwig et al. 30 then investigated the comparative evaluations of link discovery (DL) frameworks. Mohammadi et al . 44 presented statistical methods to compare two or more alignment systems in terms of efficiency. The statistical procedures were then discussed 45 to show comparisons between the two alignment systems. First, the database community considered interconnections as problems with duplication or record linkage. Elmagarmid et al. 36 aimed various techniques at resolving the heterogeneity issues of ontology matching and proposed the solution of a series of structured property segmented record data. The classification of the ontology-based models was also incorporated into the methods of character-based similarity metrics, phonetic similarity metrics, token-based similitude metrics, and numeric similarity metrics. There are certain detection methods for duplicated records, and duplicated detection tools have been developed. Otero et al. 28 addressed a variety of approaches and their functional applications in real-life, involving more than 50 ontology matching systems. Heflin et al. 46 gave an overview of the ontology relationships of ontology instances. They also summed up some matching instance algorithms, such as the scalable entity co-reference systems and manual and automated blocking key selection. They also introduced the generic algorithms that use logical reasoning based on string matching. Moreover, two extensive evaluations were made of the ontology matching systems: (1) ASMOV, 47 N2R, 48 RiMOM, 49 CODI, 50 PARIS, 51 EPWNG, 52 SiGma, 53 and MA 54 were evaluated and verified on the OAEI (Person1, Person2, and Restaurant) benchmark; and (2) EdJoin, 55 DisNGram, 56 PPJoin+, 57 and FastJoin 58 were then compared to the large scale databases, RKB and SWAT. Table 1 illustrates the benefits and drawbacks of the current ontology matching approaches.

Discussion
In particular, the current works regarding ontology matching have good results on small-scale databases (i.e., many small and medium instances) and lower-dimensional data (instances with a small or medium data properties) in terms of runtime and the solution quality. However, the current approaches have several limitations, and two key of them being inability to deal with large-scale and high-dimensional data. In this work, we present two data mining-based frameworks to address both these limitations for exploring clustering and pattern mining regarding ontology matching.

Strategy Models and Algorithms Limitations
Traditional VMI 31 RiMOM 5 Unable to deal with large-scale data.
MERGILO 33 Li et al. 32 Matching based on prior results, which decreases CILIOS 34 the overall accuracy performance.
Rosaci 35 Elmagarmid et al. 36 Unable to deal with a high number of data properties.
Data mining EIFPS 4 LOM 2 Use an old matching mechanism.
CSA 37 Algergawy et al. 39 High time consumption due to: Xue et al. 59 1. The similarity graph mechanism; Xue et al. 60 2. Combination of different measures.
the common properties among ontologies, that is, to determine the function  such that: The Equation (1) refers to the union of all the common instances between two ontologies, where two instances are similar to a set of data properties (see Equation (2)).

F I G U R E 2 Ontology matching-based instance
The naive approach of the ontology matching problem is to scan all values of the instances among the ontologies and make comparisons. The process of matching determines the outcome of the alignment. Each matching may lead to different alignment instances. Each result of alignment is then evaluated and compared to the alignment of reference. The reference alignment is an alignment proposed by a user or expert in the particular domain. The alignment of references includes all the common ontology instances.
For instance, Figure 2 presents a simple example for ontology matching by instances. Consider two ontologies in the running example, such as  1 and  2 . The first step aims at extracting the set of instances  1 m 1 and  2 m 2 and grouping them into several subsets. The matching process is then performed to derive an alignment among the ontologies. The reference alignment represents the set of the common instances among two ontologies. Thus, the optimal matching between  1 and  2 is, for example, i 1 = i ′ 12 , i 3 = i ′ 15 , and i 10 = i ′ 26 . In the ontology matching problem by instances, the most important issue is to find the maximum real-world across two large-scale ontologies. Consider m 1 and m 2 as the number of instances of two ontologies. If the size of the instances is very large, for example, more than 10 million, then it requires high computational cost (e.g., GeoNames 3 dataset with more than 10 million geographical names). To handle the large-scale ontology data, we present the clustering-based method to find the highly correlated subsets for ontology matching by instances.

Principle
The aim of COMI is splitting the whole set of instances of each ontology into several dependent clusters. Each cluster then contains highly correlated instances to be processed later. Next, as explained in Figure 3, COMI explores the instances of the clusters to find the common features. It mainly includes the clustering and matching processes. In the clustering process, the instance set is divided into several collections of subinstances (clusters) using data mining techniques. This step is considered to be preprocessing. The set of instances is then grouped into different clusters with a small number of instances. Each cluster of instances shares the maximum number of common properties; thus, the instances of a cluster are highly correlated. During the matching process, COMI explores the instances of the clusters to find the alignments. Instead of performing Algorithm 1. COMI: Clustering for Ontology Matching .. i n i }: the set of n instances of the ontology  i . Output: : Alignment set. ********** centroid initialization ********** return  i ********** Matching Process ********** list ← ∅ for p = 1 to k i do ⊳ Finding the similar clusters Clustering for ontology matching-based instances the alignment operation between the instances of ontologies one by one, the alignment is established between the instances of the two ontologies and their representative clusters. Algorithm 1 presents the COMI pseudo-code. The set of instances  is considered as input, and the best alignment as . The set of clusters is represented by , and the set of centroids is stated as g. The first step is to randomly initialize the centroids using the function InitializeCenters(). The first loop is performed from lines 6 to 17, which scans all the set of instances I. The function  instance (e, g 1 ) calculates the distance between the instance and the first centroid g 1 . Consider e = {(Name, Joe), (age, 26), and (type, man)} and the centroid is set as g 1 = (26, man, USA),  instance (e, g 1 ) to calculate the intersection of values, which is set to 2. The loop from lines 9 to 13 finds the smallest distance between the instance e and all the centroids in g, where it conserves the range r. Line 16 affects the instance e to the list of cluster r, which represents the minimum distance using the function AddElement(). From lines 18 to 24, the centers are updated and kept in the set g ′ . If g new is equal to the previous center in g, then the clustering process is then terminated; otherwise, the same process is repeated until g new and g become the same. The final clustering results are then kept in a matrix structure, which is called . Each element [i][j] is the distance between the centroid g j and the ith instance of the jth cluster, denoted as  i j (lines [25][26][27][28][29]. From lines 35 to 45, the algorithm scans the set of centroids G i , G j of the two ontologies O i and O j , and the minimum distance between two centroids with the function  centroid ( g i ) is determined. The minimum distance is selected and the two clusters are added to the list of the alignment clusters list using the function AddClusters(). From lines 48 to 58, the algorithm scans the whole instances of the two aligned clusters. Here, p and q are represented as the two selected clusters, and the loop from lines 50 to 56 scans all the instances e 1 and e 2 for both clusters p and q, and the minimum distance can be computed using the formula  instance . For the set of aligned , the alignment results of the clusters p and q are then added and denoted as  p,q . This process is repeated for all the clusters in list. Next, the decomposition and matching steps are described in detail.

Decomposition
The ontology matching problem usually deals with a large number of instances, which is a nontrivial task, especially when the ontology is large scale. Thus, it is necessary to decompose the huge data into a small number of clusters that reduce the difficultly of the matching process (Algorithm 1). In this section, we investigate the partitioning-based approach and utilize the k-means 25 algorithm for the matching problem. The distance and the centroid computation are defined below.
Definition 2 (distance between instances). We note p i jl as the value of the property  i j in the instance  i l of the ontology  i . The distance  instance between two instances  i l 1 and  i l 2 is then defined as To compute the centroids, we consider the set of instances of the cluster The aim is to find a gravity center of this set that is also an instance. Inspired by the centroid formula developed in prior work, 61 we compute the centroid s . The frequency of each value is calculated for all the instances of the cluster G s . The values of instances in G s are sorted according to their frequency, and only the n i frequent value is assigned to s as s = {j|j ∈  n i }, where  n i denotes the set of the n i frequent items of the cluster G s . k-means is a well-known partitioning-based clustering algorithm. It defines k clusters and divides the set of instances of each ontology into k subsets by considering the correlation between the instances of the same cluster. The k-means process starts by initializing k clusters. The k instances from the given ontologies can be randomly selected. Then, it scans each instance from the whole set, calculates the distance between this instance and all the centroids, and assigns it to the cluster with the nearest centroid. After all the instances are examined, the centroid of each cluster is then updated. This process is repeated until the cluster centroid becomes stable.

Matching process
This step benefits from the clustering step by defining a new matching strategy instead of computing the similarity between two pairs of instances of the given ontologies. The similarity measures between the centroids of the clusters and the instances are then determined. Two distances are defined: the first distance aims at determining the similarity between two centroids in different ontologies while the second represents the distance between two instances in different ontologies (Algorithm 1). The principal idea of the matching process is to find two highly correlated clusters among ontologies by considering the minimum distances of them. After that, the instances among the clusters are check to attempt to find the rough instances. Consider g i l 1 and g j l 2 as two centroids of the input ontologies.
Definition 3 (distance between centroids). g, and g ′ are considered to be two centroids of two different ontologies. The distance  matching between the two centroids g and g ′ is defined as It should be noted that |g|, |g ′ |, and |g ∩ g ′ | are the number of properties of the centroids g and g ′ and their intersection, respectively. Definition 4 (matching instances). We define the distance  matching between two instances  i l 1 and  j l 2 as the sum of distances between each instance and its centroid and the distance between the two centroids of these instances as where  1 is  centroid and  2 is  instance .
The complexity of COMI depends on the number of instances n, the number of properties m, the number of clusters k, and the number of matchings r. The decomposition step needs O(n × m × k). This process is performed only once for each ontology whatever the number of matchings to be used. Only similar clusters are used during the matching process. This requires O( n×m k ). The total cost of COMI for perform r matching is O(n × m × k + r × n×m k ), which is significantly lower than the baseline solutions that require O(n × m × r).

Principle
POMI, as shown in Figure 4, investigates the correlation between data properties of the ontological systems to obtain the best characteristics for a matching process. It extracts the most relevant data properties that cover as many instances as possible from the pattern mining process. 62 FIM refers to the extraction from the transactions database of the relevant itemsets that accomplish the minimum support limit (minsup). In the designed three phases model (mining, pruning, and selection), we follow a classical pattern mining method to efficiently discover the best features of the ontologies. The pruning process is a significant difference between the previous mining strategies and our pattern-mining-based model. Existing strategies list all the patterns that exceed minimum support constraints, while our approach considers other measures by discovering a subset of relevant patterns that cover a maximum of transactions in the database (i.e., the instances in the study). The algorithm is presented in the pseudo-code given in Algorithm 2. The mining step is performed from lines 4 to 18, and the pruning strategy runs from lines 21 to 38. The selection and the matching processes are detailed in lines 33-56.

Pattern discovery
In the pattern-mining field, the fundamental algorithms that were presented, such as Apriori, 62 DIC, 63 or FPGrowth, 64 require a huge amount of time cost and memory usage to discover the set of frequent itemsets regarding the predefined minimum support threshold. SSFIM 26 was recently presented to discover frequent itemsets within one single pass, and it is an insensitive algorithm for the minimum support threshold. The experimental results showed that the SSFIM has a better performance compared to the state-of-the-art pattern mining algorithms. Thus, in this study, SSFIM is utilized in the designed model to discover the frequent literals (labeled as S) from the F I G U R E 4 POMI: Pattern mining for ontology matching-based instances set of the instances I. Two main steps are taken for SSFIM: generation and extraction. In the generation stage, beginning with I 1 , we refer to Pattern(I 1 ) in all possible literal combinations of this instance. The outcome is applied to H by generating Pattern(I 1 ) for each pattern. In the hash table H, the frequency of each pattern is initiated one by one. Then, I 2 for each pattern in Pattern(I 2 ) is generated in the second instance. If the pattern is available at H, then its frequency must increase by one or a new entry with a fixed frequency of 1 will be made. This is repeated until I is processed for all the instances. The second step discovers the frequent patterns (i.e., frequent literals in the study) from the H hash table. The support for each t pattern is determined (see Equation (6)). If the frequency of t is no less than minsup, then t is considered as the frequent literal and is put into the set of S, which is the set of frequent literals.

Pruning
The limitation of generic pattern mining is that a large number of frequent patterns are discovered, which results in inefficiency while handling situations with many ontologies. It is a time-consuming and a nontrivial task to analyze a huge number of the discovered patterns. To overcome this limitation, a new strategy is presented to well filter the mined frequent patterns  3: **********mining step************ 4: for each instance e ∈ I i do ⊳ Extract the frequent itemsets using only a singlepass.

5:
F i e ← Itemset(e) 6: for each element i ∈ F i e do 7: if i ∈ H i then 8 end if 19: end for 20: **********pruning step************ 21: sol ← InitialSol(S i ) 22: S i, * ← S i 23: iter ← 0. 24: while Pruning max (S i ) < m and iter < IMAX do ⊳ Select the smallest itemsets that cover the largest number of instances. 25: neighbors ← ComputeNeighbors(sol). 26: best ← BestNeighbors(neighbors). 27: if Pruning max (best) > S i, * then 28: S i, * ← best. 29: end if 30: iter ← iter + 1. 31: end while 32: **********selection step************ 33: ← ∅ 34: for each property p ∈ P i do 35: if Probability(p, S i ) > then ⊳ A threshold is used to select the appropriate data properties. 36 end if 38: end for 39: **********matching step************ 40: for each instance j ∈ I i do 41: P i ← SetProperties(j) 42: for each instance l ∈ I j do 43: P i ← SetProperties(l) 44: L ← ∅ 45: for each property p ∈ P i do 46: for each property p ′ ∈ P j do 47: if Value(p, p ′ ) then ⊳ Comparison of the two instances i and j by taking in consideration the selected set of 48: L ∪ {p, p ′ } ⊳ properties <P i , I i > for the i th ontology, and <P j , I j > for the jth ontology respectively. 49: end if 50: end for 51: end for 52: if L !=∅ then 53:  ∪ ({ID i , ID j }∪L) 54: end if 55: end for 56: end for 57: return  in the mining progress; thus, a small number of meaningful and significant patterns can be discovered to well explain and illustrate the ontology database. Here, we use a novel idea, called Coverage, in the designed pruning strategy, which results in keeping fewer and more representable patterns based on the Minimum Description Length principle 65 to cover the largest number of instances from an ontology (Algorithm 2). The number of frequent patterns can be significantly reduced. The discovered patterns in the developed model are different to the maximal 66 or closed 67 frequent patterns. More detailed explanations for the proposed solutions are given below. Definition 5. Let S = {S 1 , S 2 , … , S r } be the set of the discovered frequent patterns in the mining progress. The coverage pruning problem is defined by maximizing Pruning max as Definition 6. The Pruning max is defined as a function that can be used to cover the maximum number of records from the given ontology database. Let (S i ) denote the set of instances covered by a pattern S i . The purpose of the coverage pruning function is to return a subset S ′ ⊂ S that maximizes the coverage value and can be defined as Definition 7. Finding the minimum subset S * ⊂ S is an optimal solution to the coverage pruning problem in an ontology that includes m instances. Here, S * covers all the records and is then defined as follows: Pruning max (S ′ ) = Pruning max (S * ) ⇒ |S ′ | ≥ |S * |.
As a frequent set of S patterns can be selected from 2 r subsets of possible S subset, to find the optimal subset that meets the limitations of coverage pruning is an NP-complete problem. A thorough search would, therefore, be extremely time-consuming or even impractical if the S cardinality is large. To tackle this problem, the greedy search approach can be combined with neighboring search to reduce the search space and to provide a reasonable solution rather than an optimal solution globally. We were inspired by the work of Hosseini et al., 68 where the greedy algorithm is used to list the search tree and perform local searches on each generated node. The set of frequent patterns S, a maximum number of iterations, and the number instances in the given ontology are first considered, and the output result is the set of patterns as S * . The first progress is created by randomly selecting frequent patterns from S. The solution is then placed in an S * variable that is the best solution for now. Then, an iterative process is performed to improve the current solution so that a better solution can be obtained. This progress is repeated until S * is less than m in the number of instances covering the patterns or the iteration number is less than the maximum number of iterations. To improve the current solution, the neighborhood neighbors of the solution is determined. All the solutions are produced that can be accomplished by adding another frequent pattern to the current solution. The best solution among those solutions is denoted as best, and if it is better than the best solution S * at the current stage, then the variable S * is set as best based on the pruning function. It should be noted that if the two solutions, such as sol 1 and sol 2 , hold the condition as Pruning max (sol 1 ) ≤ Pruning max (sol 2 ) and |sol 1 | ≤ |sol 2 |, then sol 1 is considered to be a better solution than that of the sol 2 . The reason for this is that the number of patterns should be minimized. A greedy model is first presented to obtain a set of the smallest number of frequent patterns that maximize the number of events covered by the patterns. It should be noted that that other pruning functions can be used for other requirements.

Selection
The set of SP is properly selected according to the pruning strategy and the mined frequent literals S. Let P(i, S) denote the probability of the ith property appearing in the set of S frequent literals. A threshold is set in a range of [0, 1] that is used to find the data properties properly. If the probability value for each property is higher than , then it is added to the SP set (Algorithm2).

Definition 8.
Consider the data property of p and S, which is the set of frequent literals discovered by the pruning step. The p is obtained for matching progress if it satisfies the condition where P(p, S) is the probability of the property p in the frequent literals S and is the interestingness degree threshold.

Matching process
The instances of fundamental ontology are compared to the instances of the second ontology after the selection of the correct data properties. The fundamental ontology of BO in this part is matched with the second ontology of O. Moreover, < P, I > is then considered to be the set of data properties of P and the instances of the fundamental ontology of I. Furthermore, < P ′ , I ′ > is considered to be the set of data properties of ′ P and the instances of the second ontology of I ′ . For this situation, P and P ′ are then considered to be two sets and are respectively obtained from the described feature selection models. For iterative matching, the entire set of instances for the fundamental ontology of I is then determined and compared to the set of instances in the second ontology of I ′ . Those two instances are then compared by determining each value of the i th instance from BO for all the jth instance values from O. The complexity of POMI depends to the number of instances n, the number of properties m, the number of selected properties m ′ , and the number of matchings r. The pattern mining step needs O(n × m). This process is performed only once for each ontology, whatever the number of matchings to be used. During the matching process, only the selected properties are used. It should be noted that m ′ <<< m . This requires O(n × m ′ ). The total cost of POMI to perform r matching is O(n × m + r × n × m ′ ), which is significantly lower than the baseline solutions which require O(n × m × r).

PERFORMANCE EVALUATION
Extensive experiments were conducted on well-known ontology databases to validate the usefulness of proposed COMI and POMI frameworks. The experiments were carried out on a desktop with an Intel i7 processor and 16 GB of main memory. Java language was used for all the implemented algorithms. The experiments employed three well-known ontology databases that are often used in the ontology matching community (Regarding the tests, each experiment is assigned to the same dataset for all systems). Details are described below.
1. DBpedia 4 is a superficial cross-domain ontology, it was created manually based on Wikipedia. We extract structured content from the information created in Wikipedia. This structured information is available on the World Wide Web. The ontology currently covers 2795 data properties and 4,233,000 instances. 2. The information (i.e., number of instances and data properties) of Ontology Alignment Evaluation Initiative (OAEI) 5 databases is shown in Table 2. OAEI is an international initiative. The increasing number of methods available for the matching ontologies has arisen to this company for the evaluation of these methods. Among the objectives of OAEI, it is to assess the strengths and weaknesses of alignment systems, compare the performance of techniques, and improve assessment techniques to help improve the work on the matching ontologies. 3. The Smart City Use case 6 contains more than 400,000 sensing objects allocated around the world. It also has varied aspects for the data distribution. Moreover, it has more than 8.5 billion sensor records in the dataset.

Performance on DBpedia
Two baseline algorithms, EIFPS 4 and RiMOM, 5 were considered in this experiment. The quality of the matching process of the ontology was evaluated using the F-measure, which is used to define the output of the matching process A and a reference alignment R as The precision was computed as |R∩A| |A| , and the recall was computed as |R∩A| |R| . It should be noted that the ground-truth represented by the best alignment was annotated by domain experts, which is a human-being procedure.

5.1.1
Runtime performance The first set of experiments was performed to compare the runtime of COMI with state-of-the-art approaches under varied clusters. COMI|X|, where |X| is the number of the clusters, was used in the COMI approach. The runtime computed in this experiment was the runtime of the whole COMI process including the decomposition and matching steps. Figure 5 shows the runtime of the five approaches (COMI2, COMI5, COMI10, EIFPS, and RiMOM), where the percentages of instances varied from 25% to 100%. When the number of matchings increased from 1000 to 100,000, COMI outperformed the two other approaches. Moreover, the runtime of COMI remained stable, while the baseline approaches required additional computing time for a large number of instances and many matchings. Thus, the two compared approaches (EIFPS and RiMOM) needed more than 600 s for handling the 100,000 matchings in the whole DBpedia ontology database, and the designed COMI10 (COMI with 10 clusters) required only 54 s. These results are explained by the fact that our approach only considers highly correlated instances in the matching process by developing an efficient strategy to explore the information provided in each cluster of instances. The results also show that by increasing the number of clusters from 2 to 10, a slight difference in terms of execution time could be obtained. The clustering process was only adopted in the preprocessing step.

Solution quality
A second set of experiments was performed to compare the quality of the solutions by COMI with the state-of-the-art EIFPS and the RiMOM algorithms using the DBPedia ontology database. Figure 6 shows the results of the five approaches (COMI2, COMI5, COMI10, EIFPS, and RiMOM), where the percentages of the instances and the properties varied from 25% to 100%, respectively. The results reveal that the COMI10, EIFPS, and RiMOM approaches had a similar quality, while COMI5 and COMI2 provided less quality compared to the first ones. Thus, if more clusters are generated, then the designed COMI can achieve better results; for example, 10 clusters for DBPedia data. Moreover, COMI10 had better performance than the EIFPS and RiMOM algorithms under large-and high-dimensional ontology data. For instance, when the percentage of properties and instances was set to 25%, the F-measure of EIFPS and RiMOM, respectively, were 81% and 82%, while COMI10 did not reach 80%. However, for 100% of data, the F-measure of COMI was 93%, while the F-measure of the two other approaches was around 60%. We explain this issue by the fact that the clustering quality with k = 10 was better than k = 2, and k = 5. More similar clusters sharing a high number of properties were obtained with k = 10, instead of more heterogeneous clusters with different properties that were determined by exploring two and five clusters. Only 2, 5, and 10 clusters were studied in this experiment because the clustering quality was reduced when setting the number of clusters above 10. It can be concluded from these results that COMI achieved the best results in terms of runtime compared to the existing ontology matching algorithms, particularly for large ontologies like the DBPedia database. Moreover, this issue does not degrade the quality of the solution if the appropriate number of clusters is chosen. The quality of the matching between the POMI framework and baseline algorithms (i.e., EIFPS and RiMOM) conducted on the OEAI ontology database is compared in Table 3. The POMI framework exceeded the other two algorithms on quality (recall, precision, and F-measure) in all the cases by changing the percentage of the data properties and the percentage of instances from 20% to 100% in all the cases except in the first case that included 20% of databases and instances. This also shows that the increase in data properties and the number of instances did not affect POMI quality. Thus, the POMI quality was up to 92%, while the EIFS quality and the RiMOM quality were below 70% and 72%, respectively. These results were achieved by the pattern mining techniques that obtained the most relevant data properties of ontologies.

Performance on OAEI
In this experiment, the scalability of the COMI and POMI frameworks were evaluated. Several criteria, such as the quality of the solutions, the computational cost (i.e., runtime), and memory usage, were evaluated on the OAEI ontology databases. Standard Java API was used in the experiments to show the memory usage of the compared algorithms. Results in Table 4 present the F-measure, CPU time, and memory usage of POMI and COMI under various ontology databases and strategies (i.e., exhaustively enumerates all possible matching of the two ontologies). As shown, POMI achieved the best results compared to the other two strategies in terms of F-measure for 15 and 18 cases. The quality of POMI in all the cases was up to 92%, while the quality of the COMI and exhaustive was less than 84% and 72%. These results were achieved with the knowledge discovered by POMI, which allowed the dimensional space of ontology databases to be reduced better. The results also showed that the memory usage and runtime performance of both COMI and POMI converged to the same values. The exhaustive approach, however, achieved the worst results of both measures, which can be attributed to the fact that the exhaustive strategy listed all the combinations without increasing the search process. The other two strategies enhanced the exploration of solution space by using the clusters and the relevant discovered patterns.

Case study on smart-city semantic modeling
The last set of experiments aimed to show the ability of COMI and POMI algorithms to deal with semantic modeling in smart-city environments. While plenty of proposals have been made related to smart-cities data, the semantic modeling from these data is an open research problem in the smart-city community. In this study, we deal with this challenging issue by applying the ontology matching process on the smart-city data described in http://www.noaa.org/. Table 5 shows the results of the three approaches (POMI, COMI, and RiMOM), where the percentages of the instances and the properties varied from 20% to 100%. The results revealed that the COMI and POMI outperformed RiMOM in terms of runtime and solution quality. These results confirm again the usefulness of COMI and POMI for solving the ontology matching problem and their ability to deal with heterogeneous large-scale data. From our extensive experiments dealing with smart-city data, some perspectives remain to be studied: 1. Outlier Detection: Many outliers were found in the experiments. These outliers reduced the overall performance of the ontology matching process. It would be beneficial to remove them TA B L E 5 A comparison of F-measure and CPU of pattern mining for ontology matching-based instances (POMI), clustering for ontology matching-based instances (COMI), and the RiMOM using the smart-city data by varying both the percentage of instances (%I) and the percentage of the data properties (%P) from 20% to 100% in the preprocessing step. One solution is to apply the existing outlier detection algorithms, such as the local outlier factor and k nearest neighbors. A local reachability distance between properties and instances should be developed to adapt these algorithms for an ontology. 2. Crowdsourcing: Ontology matching solutions could identify different alignments from the same data. The problem is how to decide which alignments are useful for the city planners. A crowdsourcing approach may be applied to improve the usefulness of the detected alignment, where different ontology matching approaches should work together to identify the best alignments delivered to city planners. Agents represented by approaches and programs could find locally the alignments and send them to the city planners. Then, the city planners could use crowd-sourcing environments to find the best alignment for the smart city semantic modeling. 3. Missing of ground truth: Missing of the ground truth is a common problem in evaluating ontology matching algorithms, in particular, for real scenarios, such as smart-city semantic modeling. As challenges for future research regarding the quality assessment of ontology matching results, the following issues and research questions remain to be addressed: • Defining useful, publicly available benchmark smart-city data for semantic modeling problems is beneficial for analyzing the ontology matching algorithms.
• It would be very useful to identify the meaningful criteria for an internal evaluation of ontology matching. One way to address this challenging issue is to provide unified ranking-function scores to rank the alignments. These functions should be independent of the whole process for identifying the best alignments.

CONCLUSIONS
This paper presented two new frameworks, called COMI and POMI, which are cluster-based and pattern mining-based approaches, to solve the ontology matching problem. COMI utilizes the clustering method to solve the matching problem among the ontologies, and it mainly consists of two steps. The first step aims at grouping the highly correlated instances of each ontology into similar clusters using the k-means approach. This is a preprocessing step and is only performed once. Then, the extracted knowledge is then used to find the matching between the instances within the ontologies. POMI selects the most frequent data properties that describe the overall instances of that ontology and explore different correlations between data properties. To evaluate the performance of COMI and POMI, several experiments were carried out on the DBpedia and OEAI ontology databases. The experimental results showed that COMI is much faster than the baseline EIFPS and RiMOM algorithms, and POMI gives good quality compared to EIFPS and RiMOM. Furthermore, a case study on smart-city semantic modeling was given, demonstrating the ability of COMI and POMI to deal with heterogeneous large-scale smart-city data. In our future work, other data mining techniques, such as more pruning strategies 69,70 and high-utility pattern mining, 19,71 could be used for extracting more relevant knowledge for helping the ontology matching process. Using emergent HPC, such as GPU, [72][73][74] to handle the very large-scale ontology databases will also be considered as an extension of this in future works. In addition, using the clustering in other semantic modeling such the integration of existing databases and building of shareable databases are the further research topics in the future.

DATA AVAILABILITY STATEMENT
The data that support the findings of this study are openly available in DBpedia at http://wiki.dbpedia.org/Datasets, OAEI at http://oaei.ontologymatching.org, Smart City Use case at http://www.noaa.org/.