Orchestration ‐ based mechanism for sampling adaptation in sensing ‐ based applications

Currently, the world witnesses a boom in the sensing ‐ based applications where the number of connected devices is becoming higher than the people. Such small sensing devices are now deployed in billions around the world, collecting data about the surroundings and reporting them to the data analysis centres. This fact allows a better understanding of the world and helps to reduce the effects of potential risks. However, while the benefits of such devices are real and significant, sensing ‐ based applications face two major challenges: big data collection and restricted power of sensor battery. In order to overcome these challenges, data reduction and sampling sensor adaptation techniques have been proposed to reduce data collection and to save the sensor energy. The authors propose an orchestration ‐ based mechanism (OM) for adapting the sampling rate of the sensors in the network. OM is two ‐ fold: first, it proposes a data transmission model at the sensor level, based on the clustering and Spearman coefficient, in order to reduce the amount of data transmitted to the sink; second, it proposes a sampling rate mechanism at the cluster ‐ head level that allows searching the similarity between data collected by the neighbouring sensors, and then to adapt their sensing frequencies accordingly. A set of simulations on real sensor data have been conducted to evaluate the efficiency of OM, in terms of data reduction and energy conservation, compared to other exiting techniques.


| Problem statement
Indeed, sensing-based applications provide many challenges for both community and researchers. On one hand, sensor's hardware is limited in resources especially in battery supply which cannot be replaced or recharged in hostile or harsh environments [3,4]. On the other hand, the density deployment of the sensor nodes along with the need for continuous zone monitoring leads to a massive amount of data collection in such networks [5,6]. Consequently, transmitting such big data will quickly deplete the available energy of sensors. Therefore, researchers have focused on data reduction and adapting sensing frequency techniques to improve the energy consumption in sensors and reduce the complexity of data analysis at the sink node [7][8][9][10].

| Our contribution
Here, an orchestration-based mechanism for energy conservation and minimising transmission in sensing-based applications is proposed. The objective of this mechanism is to adapt the sampling rate of each sensor according to the variation of the monitored condition, the remaining energy of the sensor and the similarity to data collected by neighbouring nodes. The contribution of this study is described as follows: • At the sensor level, a new version of K-means clustering algorithm called SK-means, for example, Spearman-based K-means that combines the Spearman coefficient to the traditional K-means is proposed. The new version aims to overcome the two main challenges of traditional K-means: the selection of the optimal number of clusters and the convergence function. Subsequently, SK-means allows reducing the periodic data transmission from each sensor to the sink thus avoiding network overload and saving sensor energies. • At the cluster-head (CH) node, the authors propose a model that, at first, allows searching for the spatial-temporal correlations between nodes. Then, based on the node correlation degree, the model adapts the sensing frequency of each sensor according to the similarity with data collected by its neighbours. This will lead to reduce the data collection size, eliminate the redundancy existing among nodes and enhance the power consumption of the sensor.
Through simulations on real sensor data, the effectiveness of the proposed mechanism has been validated in terms of minimising energy consumption and data transmission compared to other existing techniques.
The remaining article is organised as follows. Section 2 outlines different data reduction and energy-efficient techniques proposed in sensing-based applications. Section 3 depicts the periodic clustering architecture used in the network. Sections 4 and 5 present the data reduction model and the sampling rate model proposed at the sensor and CH levels, respectively. Simulation results are discussed in Section 6. Finally, the conclusion and future work are highlighted in Section 7.

| RELATED WORKS
In recent years, proposing energy-efficient techniques was the main target almost of all researchers' works. Their primary focus is to reduce the data routed over the network either at the sensor node or at intermediate nodes, mostly the CHs. Indeed, in most of the proposed techniques, the reduction process can be performed based on data aggregation [11], clustering [12], compression [13] or adapting sampling rate [7,8].
The authors in [9,10,[14][15][16][17] dedicated their works in reducing the raw data transmission at the level of sensors. In [14], the authors propose a priority-based compressed data aggregation (PCDA) technique to reduce the amount of heath data transmitted. PCDA uses compressed sensing approach, based on the sensing matrix and convex optimisation, followed by a cryptographic hash algorithm, which uses a key predistribution scheme, at the biosensor level to save information accuracy before sending data for diagnosis. The simulation shows that PCDA ensures a low execution time and communication overhead with a moderate energy consumption. In [15], the authors propose a sequential lossless entropy compression (S-LEC) which organises the alphabet of integer residues obtained from differential predictor into increased size groups. S-LEC codeword consists of two parts: the entropy code specifying the group and the binary code representing the index in the group. The performance of S-LEC is evaluated based on real word dataset from sensor scope and volcanic monitoring and the obtained results show reduced energy consumption for dynamic volcano dataset compared to other existing techniques, particularly LEC and S-LZW. In [9], the authors propose three mechanisms that allow the sensor to adapt its sampling rate to the variation of the monitored environment. The proposed mechanisms are respectively based on similarity functions (Jaccard coefficient), distance functions (Euclidean distance) and analysis variance with statistical tests (ANOVA and Bartlett test). The proposed techniques work on rounds, where each round consists of a set of period time, in which the sensor adapts its sampling frequency at the end of each round. By adapting different scenarios, the proposed techniques realise minimum energy consumption with accurate data collection. Finally, the authors of [10] propose an adapted version of the dual prediction scheme (DPS) algorithm. The new version uses a collection of models for data prediction during the past sequences of the DPS algorithm without updating the history data table classically. Indeed, the new prediction model is computed at the sensors and sent to the sink or vice-versa. The performance of DPS is tested using the data collected from the meteorological station located at Tlemcen (Algeria) while the results show that the data transmission ratio is reduced by more than 90% when accurate predictions are achieved.
The authors in [18][19][20][21][22][23] dedicated their works to reducing the amount of data circulated in the network along the path to the sink, for example, at intermediates nodes. In [19], the authors propose a cluster-based data gathering algorithm for WSN called lifetime-enhancing cooperative data gathering and relaying (LCDGRA). LCDGRA works on three phases: the first phase aims to group the sensor nodes into clusters based on K-means clustering and Huffman coding algorithms. The second phase assigns a set of relay nodes to each CH in order to aggregate data before sending to the sink node. During the last phase, the aggregated data are coded based on random linear coding and then relayed to the base station. The simulations show that LCDGRA can ensure an efficient convergence of K-means (average of 31 iterations), a reduced data latency (up to 18%) and less energy consumption (up to 37%) compared to other techniques. In [20], an online data tracking and estimation (ODTE) is proposed to track poor data collected at the sink. ODTE is mainly based on two systems: Data prediction system (DPS) and distortion factor (DF). DPS is used at the sensor in order to reduce its transmission using a defined limit while DF estimates an optimal data collected at the sink node. Although ODTE can highly reduce the data transmission and conserve the battery nodes, it is very complex in terms of computation and processing speed. The authors of [18] propose a routing protocol called gateway clustering energy-efficient centroid (GCEEC) for WSN. The objective of GCEEC is to improve the load among the sensor nodes as well as selects and rotates the CH near the energy centroid position of the cluster. The results show that GCEEC can extend the network lifetime highly and reduce the network overload. However, this is limited to many assumptions taken during the tested scenario. Finally, the authors of [21] propose a structure fidelity data collection (SFDC) technique dedicated to the cluster-based periodic applications in WSNs. SFDC searches both spatial and temporal correlation between nodes, using distance functions and similarity metrics, respectively. Then, it exploits the dependencies to reduce the number of nodes required to work for sampling and data transmission and prove that such reduction is bound to save energy.
The authors in [3][4][5][6]24,25] dedicated their works to minimising the data transmission at several levels in the network, for example, sensor and CH levels. In [24], the authors propose a data management framework for data collection and decision making in connected healthcare. The framework relies on three algorithms: first, an emergency detection algorithm aims to send critical records directly to the coordinator; second, an adaptive sampling rate algorithm based on ANOVA and Fisher test in order to allow each sensor to adapt its sampling frequency to the variation of the patient situation; and third, a data fusion and decision making model is proposed at the coordinator and it is based on a decision matrix and the fuzzy set theory. Although it has great advantages for patient monitoring and assessment, the proposed framework suffers from two main disadvantages: (1) in case of low critical patient, none of the data would be archived in the hospital thus, revising patient archive to check patient progress from doctors is not possible and (2) predicting the progress of the patient situation in the subsequent periods of time is not possible by the doctors. The authors of [5] propose a spatial-temporal model to extend the network lifetime based on three similarity metrics: Euclidean distance, cosine similarity and Pearson product-moment coefficient (PPMC). Then, they propose a scheduling algorithm for switching correlated sensor nodes to the sleep mode. By performing real experiments, the authors show that PPMC gives the best results, in terms of conserving network energy, compared to other similarity metrics. However, PPMC has several disadvantages: (1) it does not search the similarity at the sensor node level; (2) it does not take into account the residual energy of the sensors when switching them to the sleep mode; and (3) it assumes that all the correlated sensors have the same degree of correlation. Finally, the authors of [25] propose a two-level node mechanism which is dedicated to periodic sensor applications. First, the authors propose an onnode aggregation method to remove redundant data collected by the sensor. Then, an in-network data reduction called prefix frequency filtering (PFF) is introduced at the CH level. PFF allows CHs to find similarities between data collected by neighbouring nodes in the same cluster, using the Jaccard similarity function.
Despite that most of the proposed techniques allow efficient energy saving, they fail to satisfy all aspects of sensing-based applications and lack maturity. In addition, they are very complex and require massive processing. In the proposed work, an energy-efficient data reduction mechanism that is less complicated and suitable for limited resources sensor nodes is presented. Furthermore, the proposed mechanism takes into account several parameters when adapting the sampling rate of the sensor in order to save the integrity of the collected information.

| CLUSTER-BASED ARCHITECTURE NETWORK
In sensing-based applications, the network architecture represents one of the most important challenges after deploying the sensor nodes. Indeed, some metrics (such as congestion, energy consumption, network overload, and data loss) are highly dependent on network architecture. Here, the proposed mechanism relies on two main concepts of the network: cluster-based architecture and periodic data acquisition.
On one hand, the cluster-based network has been considered as an efficient architecture for sensing applications in terms of energy conservation, high network scalability and data transmission. Typically, cluster-based architecture divides the sensors in the network into clusters and assigns a cluster-head (CH) for each cluster. CH is responsible for managing the data collected by the sensor members in that cluster. Subsequently, CH can perform any type of data processing (like aggregation, compression, scheduling, spatio-temporal correlation etc.) over the sensor data before sending toward the sink node. Figure 1 illustrates a simple network based on the cluster architecture in which the communication between the sensors and their CHs or the CHs and the sink is performed according to a single-hop transmission. But, dividing network into clusters is not an easy task and it faces many challenges. Hence, one can find a lot of works in the literature that are interested in issues related to cluster network like selection of cluster heads [26][27][28], optimisation of cluster size [29,30], communication between sensors/ CHs and CHs/sink [31,32] etc. However, the concern of the authors is to study the variation of data collected by the sensors F I G U R E 1 Cluster-based architecture network HARB ET AL. and not the formation of clusters themselves. Therefore, a geographical clustering scheme in which near sensors are already assigned to the same cluster is considered.
On the other hand, sensor nodes are responsible for monitoring the target zone and transmit the collected data toward the CHs, which, in turn, forward them to the sink. Unfortunately, data transmission is a high-cost operation in terms of energy consumption. Thus, considering its limited energy power, the lifetime of the sensor will decrease drastically if all the collected data are sent to the CH. Hence, periodic data acquisition model has been introduced in sensing-based applications with the aim of reducing the amount of collected and transmitted data from the sensors.
Basically, in a periodic acquisition model, data are collected periodically where each period p is partitioned into time slots. At each slot t, each sensor node N i captures a new reading r i then it forms, at the end of p, a vector of F readings as follows: After that, the sensor will send its vector of data, for example, R i , to its appropriate CH.

| SENSOR DATA REDUCTION MODEL
As mentioned before, data transmission consumes lots of the available sensor energy. Thus, in order to extend its lifetime, the amount of data transmission from the sensor should be reduced. However, data collected by the sensor devices are mostly redundant and contains useless information. Thus, one of the most effective solutions to reduce the data transmission is by eliminating redundancy and filtering non-useful information before sending them to the CH. This section proposes a data reduction model that allows each sensor to locally search the similarity between the data collected periodically, remove the existing redundancy then send toward the CH. The proposed model is based on the K-means algorithm adapted to the Spearman coefficient metric that allows finding similarities among data by grouping them into clusters.

| Recall of K-means clustering algorithm
Generally, clustering is a data exploratory task that aims to group data into clusters in a way that the similarity among data in the same cluster is high and that among clusters is low. Researchers have proposed many data clustering techniques for various types of data. One of the most popular algorithms in data clustering is K-means [33]; it is flexible, simple, already adapted to a vast number of applications and used with various kinds of data [34][35][36].
Typically, the K-means is an iterative algorithm in which the process starts by randomly selecting an initial centroid for each cluster. Then, each data point is assigned to the nearest centroid and the first round of cluster formation is performed. After that, the cluster centroids are updated and the process is repeated until the convergence of the criterion function (Algorithm 1).
Subsequently, one of the most criterion functions that have been used in K-means is the sum of the square errors.

Algorithm 1 K-means algorithm
Require: Set of readings: randomly choose centroid c j among R i belongs to C j 3: end for 4: repeat 5: for each reading r i ∈ R i do 6: Assign r i to the cluster C j with nearest c j (i.e., |r i , c j * | ≤ |r i , c j |; j ∈{1, …K}) 7: end for 8: for each cluster C j , where j ∈{1, …K} do 9: Update the centroid c i to be the centroid of all readings currently in C j ; so that c j ¼ 1 |C j | ∑ i∈C j r i 10: end for 11: until all K clusters meet the criterion function convergence 12: return C

| Spearman coefficient metric
Spearman correlation is a non-parametric test used to measure the degree of association between two data sets. Spearman correlation is determined by ranking each point of the two data sets and, in case of ties, an average rank is used. Moreover, the Spearman correlation gives a value between + 1 and − 1; + 1 indicates a perfect association of ranks, −1 indicates a perfect negative association of ranks while 0 indicates no association between the ranks of the two data sets. Unlike other tests, especially the Pearson coefficient, the Spearman test does not make any assumption about the distribution, or the linear relationship, of the values in the two data sets.
Mathematically, the Spearman correlation, ρ, between two data sets R i and R j can be calculated according to the following equation: where • d k is the difference between the ranks of corresponding values in the sets.
• F is the number of values in each set.
Therefore, R i and R j are considered correlated sets with similar values if and only if the Spearman correlation is greater than a threshold ρ s which is as follows:

| K-means adapted to spearman coefficient: SK-means
After collecting the data at each period, for example, R i , the sensor tries to minimise its size before sending it to the CH in order to save its energy. The authors propose to use the K-means algorithm to group similar data in R i into clusters then the data redundancy, in each cluster, will be eliminated before data transmission. However, the use of the traditional K-means faces two main challenges: the selection of the cluster number (K) and the convergence criterion function. On one hand, selecting the number of clusters is a crucial decision as it determines the data transmission ratio from the sensor and affects the accuracy of the information sent to the sink. On the other hand, the number of iterations generated by K-means is highly dependent on the selection of the convergence criterion function; thus, an inappropriate criterion function can lead to an increase in the computation process of K-means and thus affecting the data latency metric. Therefore, in order to overcome these challenges, a new version of K-means, called SK-means, adapting the Spearman correlation to the traditional K-means algorithm is proposed.
The idea behind SK-means is that all readings collected by a sensor during a period are similar and so they are assigned to the same cluster, for example, initial cluster. Then, it recursively divides the initial cluster into small clusters every time a high similarity between the readings inside the new cluster is detected. The criterion function used in SK-means to stop the cluster dividing and obtain the final clusters is the Spearman correlation. Algorithm 2 describes the process of SK-means that is applied over the readings collected at each period, R i . First, all the readings are considered similar and R i is assigned to a temporary set of clusters, e.g. L (line 1). Then, the Spearman correlation between the readings is calculated by dividing them into two equal subsets using the function Partition (lines 2-4). Thus, if the correlation exceeds the Spearman threshold, then the readings are considered similar. Consequently, the average of the readings is computed (e.g. r) and added with its weight (e. g. wgtðrÞ) to the final reading set that will be sent to the CH (lines 5-9). Otherwise, for example, if the correlation does not exceed the Spearman threshold, the readings are considered unsimilar and the K-means algorithm is applied in order to divide them into two clusters (lines [10][11][12][13][14]. The process is repeated on the new clusters until all readings within each cluster become similar. Therefore, each sensor will send a reduced set of readings C i in the form fðr 1 ; wgtðr 1 ÞÞ; ðr 2 ; wgtðr 2 ÞÞ; …; ðr k ; wgtðr k ÞÞg to the CH at the end of each period.

7:
wgtðrÞ ¼ |R j | // the total number of elements in R j

| Analytical illustration of SK-means
This section shows the process of SK-means algorithm using an analytical example ( Figure 2). It assumes a set R i consisting of 8 readings (i.e. F ¼ 8) collected during a period. The first step is to divide R i into two equals partitions R i l and R i r of size 4, using the Partition function. Then, the Spearman correlation is calculated between the partitions and it indicates that the partitions are not correlated (e.g. ρ > ρ s ). Thus, the K-means algorithm is applied over the set R i to divide the readings into two clusters C 1 and C 2 . For each cluster, the process of dividing the readings into equal partitions and calculating the correlation between them is repeated. Whilst, the K-means is applied each time, a high correlation is detected between the readings until obtaining the final clusters. Finally, the mean values for all clusters are calculated while assigning their weights (e.g. the number of readings in each cluster). Therefore, the reduced set of readings (e.g. C i ) will be sent towards the CH.  Mostly, the data collected by the sensors are spatial-temporal correlated; on one hand, the spatial node correlation results from the dense deployment of the sensors along with the random scattering strategy. On the other hand, the temporal node correlation is dependent on the variation of the monitored condition, which can speed up or slowdown, which pushes the neighbouring nodes to collect redundant data. Thus, after receiving the datasets coming from the sensors, OM proposes a sampling rate model that allows the CH to search the spatial-temporal correlations among the sensors in order to adapt their sampling rate for the next period. The objective of the model is to reduce the amount of data collection at the sensors, thus enhancing their energies, and minimising the data correlation among neighbouring nodes before sending the data to the sink. Furthermore, the proposed CH sampling model takes into account two features to calculate the new sensing frequencies of each sensor: the spatialtemporal node correlation and the remaining energy of each sensor.

| Spatial-based node correlation
In large zone sensing applications, there is need to deploy a massive number of nodes to ensure the full zone coverage and maintain a high-reliability level of the collected data. In addition, in some harsh and hostile zones, the nodes are scattered over the target zone in a random manner. This leads to a specific spatial degree among the deployed nodes in which the distance between more the spatial degree is close, and vice versa. Then, two sensor nodes are considered spatially correlated if the geographical distance between them is less than a defined threshold. Let first define that each node N i is representing by the following 4-tuples: N i = {x i , y i , S r , T r }, where x i and y i indicate the position of N i , S r is its sensing range and T r is its transmission range ( Figure 3). Thus, it is considered that two nodes N i and N j are spatially correlated if the intersection between their sensing ranges is greater than a threshold α:

F I G U R E 3 Spatial correlation between nodes
where: ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi and α is the threshold for the sensing range intersection between two nodes that may be in [0, 2 � S r ], 0 indicates that both sensors monitor the same zone area and 2 � S r indicates that no spatial correlation among the nodes. Whilst E d is the geographical Euclidean distance between N i and N j .

| Temporal-based node correlation
The temporal correlation among nodes aims to find the similarities between their collected data, even when they are spatially correlated or not. But, the closer the geographical distance between nodes or the low variation of the monitored zone can increase the similarity between the data nodes. Subsequently, one can find several functions that allow searching the similarity among data sets such as Jaccard, Dice and Cosine. Here, the authors focus on the Jaccard similarity as one of the most used and well adapted functions to several domains. Then, in order to calculate the temporal correlation among two data sets C i and C j sent from two nodes N i and N j during a period, a score table (ST) is defined at first.
The ST is a customisable guide defined by the end-user or the expert that aims to determine the criticality of readings collected about a condition. Thus, the ST allows early detection of the critical situation and alerting the end-user as fast as possible. Typically, ST defines a normal range of readings, for example, r i ; r j � � , indicating that the monitored condition is in the normal situation. Readings in the normal range are assigning a score of 0. Then, the more the readings are far from the normal range, the more their criticalities (or scores) are. Table 1 shows ST that determines the criticality of the captured readings. First, ST determines the normal range of readings and then it defines a threshold ɛ in order to calculate the deviation of a reading from the normal range; readings can be more or less than the range normal. ɛ is a user-defined threshold determined according to the application requirements. The score of readings can take a value between 0 and 3 where 3 indicates high critical readings according to the normal range.
Based on the ST, the CH calculates the temporal correlation, according to Jaccard similarity, between two data sets C i TA B L E 1 Score table   Score  0  1  2  3 Readings or or or and C j collected by nodes N i and N j , respectively, according to the following steps: • For each reading set, the CH calculates its score set. For instance, the score set O i of C i is as follows: where o t is the score of the mean reading r t , wgtðo t Þ ¼ wgtðr t Þ and o t ∈ [0, 3].
• The Jaccard similarity is calculated based on the reading scores between both sets: where |wgtðo i t Þ| is the total weight of readings having score t in O i • N i and N j are temporally correlated if the Jaccard similarity between their score sets is greater than a threshold t J : where t J takes a value in [0, 1].

| Degree of node correlation
In order to adapt the sampling rate of a sensor, the CH searches its correlation degree with other nodes. The correlation degree, indicated as D i , of a node N i represents the set of neighbour nodes that are spatially-temporally correlated to N i . Subsequently, based on Equations 3 and 4, two nodes N i and N j are considered spatial-temporal correlated if they are geographically close and their generated data sets C i and C j are similar as follows: Therefore, the correlation degree of the node N i can be defined as: It is also assumed that |D i | is the number of nodes in D i .

| Sampling rate algorithm
In addition to the correlation degree of each node, the CH takes into account the remaining energy of the node in order to adapt its sampling rate. The intuition is that if the node has a high correlation degree and its battery level is at a low level compared to its correlated nodes, then its sampling rate should be decreased, and vice versa. Consequently, the energy of the sensor will be conserved and the redundancy among the data collected by neighbouring nodes will be minimised.
Algorithm 3 shows the sampling rate model applied at the CH after receiving data sets from sensors at each period. The algorithm takes the initial sampling rate of a sensor (e.g., the period size, F ), its initial energy (e.g. E i ) and its correlation degree as input. Then, the algorithm calculates, as output, the new sampling rate of the sensor, for example, S t , for the next period. Furthermore, OM defines a threshold known as the minimum sampling rate, for example, S min , that allows taking into account the criticality of the monitored application; S min takes a value between 0% and 100% of the period size in which a value of S min is close to 0% indicating a less critical application while a value of S min is close to 100% indicating a more critical application. Moreover, an energy sampling threshold β% has been defined indicating the percentage of sampling rate that the sensor must add or reduce from its current sampling rate depending on the energy level of its correlated nodes; β% takes a value in [0%, 100%]. The process of Algorithm 5.4 starts by initialising the sampling rate of the sensor to its maximum, for example, the period size (line 1). After that, if the node has a spatial-temporal correlation with other nodes, then its sampling rate is reduced according to its correlation degree (lines 2--3). In addition, for each correlated node N j ∈ D i , if the remaining energy of N i is less (respectively more) than that of N j then the sampling rate of N i must be further reduced (increased) by β% (lines [4][5][6][7][8][9][10][11]. Finally, if the new sampling rate of N i is less than the minimum sampling rate determined for the application, then the new sampling rate of the sensor is set to S min (lines 12-14).

Algorithm 3 Sampling rate algorithm
Require: A node: N i , A period size: F , Initial energy: E i , Set of correlated nodes: D i , Minimum sampling rate: S min , Energy sampling threshold: β. Ensure: New sensor sampling rate: S t . 1:

| SIMULATION RESULTS
In order to evaluate the performance of OM, multiple series of simulations using real sensor data collected from Intel Berkeley Research Laboratory [37] is performed. This data contains readings for 46 Mica2Dot sensors with weather boards collect humidity, temperature, light and voltage values. For the sake of simplicity, the simulation only focuses on the temperature field. For every 31 s, each Mica2Dot sensor collects new reading for each feature and it sends towards the sink for archive purpose. In the simulation, a file that includes a log of about 50,000 readings for each sensor is used. It is assumed that each sensor reads the data from its corresponding file for a period of time, and then it sends them toward a CH placed at the centre of the lab after applying the proposed mechanism. Figure 4 shows the geographical distribution of Mica2Dot sensors in the Intel lab where each sensor takes an Id from 1 to 54 (yellow sign indicates some failure sensors). The algorithms used in OM are implemented based on the Java simulator and the obtained results are compared to those obtained in the PFF [25] and S-LEC [15]. Table 2 summarises the parameters used in the simulation with their tested values.
Furthermore, the ɛ threshold used in the score is set to 1 and thus the customizable score table adapted to the temperature readings is shown in Table 3: 6.1 | Reading score study Figure 5 shows the reading values collected by five sensors randomly selected (Figure 5a) with their calculated scores (Figure 5b) according to Table 3. The obtained results show the following observations: (1) the temperature condition in the Intel lab changes very slowly due to the high redundancy existing among the data collected by the sensors; (2) The spatial-temporal correlation among neighbouring nodes is high; for instance, sensors 1 and 2 or sensors 3 and 4; (3) The spatial correlation between sensors does not always lead to a temporal correlation among the collected data; For instance, the sensors 1 and 2 are spatial correlated and they generate similar data until the period number 500 then, after this period, the collected data are becoming unsimilar; (4) Sensors that are not spatially correlated can sometimes have a temporal correlation; For instance, the sensors 2 and 5 are not geographically close but, starting from the period number 700, they are collecting redundant data; and (5) The correlation among the nodes can also be verified through the same scores calculated for the data collected by the correlated nodes; the readings scores of sensors 1 and 2 are mostly varying between 0 and 2 while those of sensors 3 and 4 are between 2 and 3. Thus, the   criticality of the temperature condition in the Intel can change from one place (e.g. sensor) to another.  (Figure 6f). It is observed that OM can reduce the data transmission to the CH more than the PFF and S-LEC in all cases. Subsequently, it allows each sensor to send up to 65% of data less than PFF and up to 78% of data less than S-LEC. Furthermore, the obtained results show that:

| Data transmission ratio at sensor
• By decreasing the period size, OM allows each sensor to reduce its data transmission to the sink (Figure 6a). This is because when the period size increases, the redundancy among the collected data decreases. Thus, the sensor must increase its data transmission in order to save the information integrity.
• By increasing the spatial correlation threshold or decreasing the Jaccard similarity threshold, the periodic data transmission from each sensor will decrease (Figure 6c,d). This is due to the spatial-temporal correlation which will increase when α increases or t J decreases. Thus, the CH will reduce the sampling rate of neighbouring nodes in order to reduce the data redundancy among their collected data.
• By decreasing the minimum sampling rate, OM gives better results in terms of data reduction at the sensor (Figure 6e). This is because the criticality of the monitored application will increase when S min increases and thus, the CH must increase the sensor sampling rate in order to increase the reliability of the decision making.

| Sensor sampling rate study
This section shows the performance of the sampling rate model proposed at CH level in OM in terms of adapting the sensing frequency of a sensor based on the spatial-temporal correlation with its neighbouring nodes. The performance is studied based on the variation of the energy sampling threshold (Figure 7a) and the minimum sampling frequency ( Figure 7b) for a set of 15 periods. The obtained results show that the sampling rate of the sensor is dynamically adapted after each period depending on the spatio-temporal correlation between nodes. Although the spatial correlation is fixed respecting to the nodes deployment, the temporal correlation can differ from period to another. Furthermore, the following observations are eminent: • By varying the energy sampling threshold (Figure 7a), it is shown that the sampling rate of the sensor is more reduced with the increasing value of β. Therefore, with β = 10%, the sensor will reduce its sampling rate to the minimum sampling threshold more than with β = 5%. This confirms the behaviour of OM by reducing its sampling rate in order to save further its energy.
• By varying the minimum sampling threshold (Figure 7b), it is shown that the sensor reduces more its sampling rate when the value of S min . For instance, the sampling rate of the sensor varies mostly between 20 and 70 when S min = 20%, between 30 and 80 when S min = 30% and 40 and 90 when S min = 40%. This also confirms the behaviour of OM by increasing (respectively decreasing) the sampling rate when the criticality of the application is high (respectively low). Figure 8 shows the percentage of data loss after applying OM and the PFF technique, with respect to several parameters. But the study of data loss is an essential metric in sensing-applications that can affect the decision made at the end-user. In the simulation, a reading is considered as a lost reading if a sensor collects it but it is not sent either by their correlated neighbouring to the sink. Thus, the percentage of data loss is calculated by dividing the lost readings by all the sensors over the entire raw data. The obtained results show that OM can outperform PFF in terms of maintaining the accuracy of data compared to PFF in all cases. Subsequently, the percentage of data loss using OM does not reach 4%, in the worst case, while it exceeds 6% using PFF. This is because PFF reduces the data transmission from sensors based on the temporal correlation only while OM uses both spatial and temporal correlation, which increases the accuracy of the collected data. Furthermore, it can be observed that percentage of data loss usually decreases with the increase of the amount of data transmitted (see Figure 6). Therefore, the data accuracy of OM increases with the increasing values of ρ s and t J , or the decreasing values of F and α. Figure 9 shows the number of obtained clusters after applying the SK-means algorithm over the periodic data collected by a sensor, for a set of 10 periods. In addition, the figure shows the number of readings assigned to each cluster, for example,.