An efﬁcient data collection framework in the sky: An afﬁne transformation approach based on Internet of unmanned aerial vehicles

Due to a large amount of time-series data (e.g. precipitation, temperature, humidity) in the air, reliable and fast data collection is a challenging issue. Using an unmanned aerial vehicle (UAV) as a data collector to collect time-series data is an effective method. However, due to the


INTRODUCTION
As low-cost sensors are widely deployed in smart cities, large amounts of time-series data (TSD) (e.g. temperature, humidity, and air quality index) are collected for climate management, weather forecast, and air monitoring [1]. For example, highprecision weather forecasting can be achieved by deploying sensors such as balloons to collect rainfall data. However, it is usually necessary to deploy a large number of wireless sensor devices to collect data, and the amount of TSD is huge [2]. It is estimated that by 2020, the total number of sensor equipment installed globally will reach 30 billion [3,4]. By 2025, this number will more than double. The huge amount of data and a large number of sensor devices bring great difficulties to data collection [5].
The UAVs can plan its trajectory based on different delay requirements of sensor devices [11,12]. Therefore, it is necessary to optimize the trajectory of UAVs to reduce the time of data collection. Existing work has focused on using reinforcement learning tools to optimize UAVs trajectory and power scheduling schemes. Zhang et al. in [13] used reinforcement learning tools to design a sense-and-send protocol to optimize UAVs trajectory and power scheduling schemes. Liu et al. in [6] used a matrix completion approach to optimize trajectory of UAVs for efficient data collection. Yang et al. in [14] designed an energy-efficient scheme that performs energy trade-offs in ground-to-UAVs communications through trajectory.
Although trajectory optimization can reduce the time for UAVs to collect data, due to the limited storage capacity of UAVs, UAVs can travel multiple times to complete data collection tasks, which increases the time for UAVs to collect data.
In this context, it is difficult for UAVs to continuously collect time series data. The reasons are as follows: (1) The time delay for the continuous acquisition of time series data by sensor equipment is very large [15]; (2) UAVs need to travel multiple times to collect data, which makes UAVs unable to continuously collect time series data; (3) The storage capacity of UAVs is limited, which makes it impossible to collect large amounts Time series data. At present, this is a new challenge for efficient data collection using a data collection framework based on UAVs. Horsman et al. in [16] proposed a data transformation scheme based on singular value decomposition to reduce the space required for data storage.
To meet the time requirement of sensor device which sensing TSD, we propose an efficient data collection framework based on affine transformation approach by using Internet of UAVs. In this framework, UAV reduces the storage space and time of data collection by storing a small-scale dataset instead of a largescale dataset, where the small-scale dataset (i.e. the dominant dataset) is much smaller than the collected dataset and it can be used to restore the original large-scale dataset. Within a certain error range, the dominant dataset can use affine transformation to restore the original dataset. Specifically, UAVs can achieve efficient data collection and low-cost storage of TSD by storing dominant datasets. The main contributions of the proposed framework are as follows: • We utilize the affine transformation approach to propose an efficient UAV-enabled data collection framework to collect time-series data. In this framework, UAVs can achieve efficient data collection and low-cost storage of time-series data by storing dominant datasets. • We define the concept of the dominant dataset, formalize the problem of selection of dominant dataset, and prove that the storage capacity of data can be reduced by storing the dominant dataset in the context of using UAVs to collect data. • We use the affine transformation model as a reduction function to solve the linear compression problem in the UAVbased data collection framework. Furthermore, we propose a method for dynamically extracting the dominant dataset. • We conduct extensive experiments on real-world datasets to demonstrate the performance of the proposed framework. Experimental results show that the proposed framework can efficiently collect data and reduce the storage capacity and collection time by half.
The rest of paper is organized as follows. Section 2 summarizes the related work. Some background knowledge is summarized in Section 3. The problem description and system model are defined in Section 4. In Section 5, we propose an efficient data collection framework based on affine transformation by using internet of UAVs in detail. The theoretical analysis and simulation results are presented in Section (6). Section (7) concludes the paper.

Data collection in the sky
A large number of sensors are deployed in the air to collect surrounding sensing data, especially time-series data [8,9,[17][18][19].
In recent years, UAVs have been widely used for data collection in wireless sensor networks [5-7, 20, 21]. For example, Jawhar et al. in [5] proposed a data collection scheme in linear wireless sensor networks by using UAVs. Zhan et al. in [10] proposed an energy-efficiency data collection in UAV enabled wireless sensor network. Pan et al. in [15] used dynamic speed controlled UAVs to collect data for Internet of Things (IoT). In reference [22,23], Pěnička et al. used UAVs with limited budget and curvature for non-zero sensing distance data collection. However, the above work did not optimize the path of UAVs to collect data, which resulted in a large amount of energy consumption [13].
To address this issue, Liu et al. in [6] used a matrix completion approach to optimize trajectory of UAVs for efficient data collection. Zhang et al. in [24][25][26] designed some novel joint trajectories and power optimization schemes to reduce the energy expenditure of UAVs to collect data. Yang et al. in [14] designed an energy-efficient scheme that performs energy trade-offs in ground-to-UAVs communications through trajectory. By optimizing the trajectory of UAVs and the wake-up scheme of sensor nodes, the maximum energy consumption of the network can be reduced. Data collection using UAVs allows each node to communicate directly with UAVs, thereby reducing the energy consumed by the node due to forwarding [27][28][29].
The time cost of data collection is an important factor to consider when using UAVs to collect data [30]. Many researchers mainly focus on how to reduce the data collection time of UAVs. Zeng et al. in [31] minimizes the task completion time of the UAVs by designing the optimal drone trajectory. The authors assume that UAVs have a time limit for collecting data for each node. Specifically, the author adopted the method of placing a virtual base station (VBS), where the radius of VBS is the radius of the data collected by UAVs. Zhan et al. in [32] studied data collected by a set of sensor nodes (SN) in wireless sensor networks (WSNs) enabled by multiple UAVs. The author aims to minimize the maximum task completion time between all drones by jointly optimizing the UAVs' trajectories and SN's wake-up scheduling and association while ensuring that each SN can be successfully uploaded under a given energy budget data. Mozaffari et al. in [33] proposed a novel framework for jointly optimizing the time and mobility of UAV data collection. Specifically, the author optimizes the UAVs data collection scheme by investigating the location of different UAVs and the associated data collection time. On the other hand, it is also possible to optimize where the UAV collects data. For example, Al-Hourani et al. in [34] designed a mathematical model for obtaining the optimal elevation angle of UAVs and the ground in a specified environment. Therefore, UAVs can hover at the optimal height to cover the ground to the greatest extent. The coverage of UAVs is large, and the number of locations where UAVs need to stay is also very small. The former method optimizes the coverage of UAVs.
The cost of data storage is also an important factor to consider when collecting data with UAVs [35,36]. Due to the limited storage capacity of UAV, UAV cannot store a large amount of data, which limits the development of UAV being widely used for data collection [37]. Horsman et al. in [16] proposed a data transformation scheme based on singular value decomposition to reduce the space required for data storage. But the time cost of these solutions is relatively large. Therefore, we need to seek an efficient data conversion and storage solution.
All of the above are schemes that use UAVs for data collection. However, none of these solutions consider the size of the data collected and the storage capacity of the UAVs. It is very difficult for UAVs to collect all the data in one trip. Therefore, in this case, enabling UAV data collection is a challenging issue.

Affine transformation
Affine transformation is an important type of linear geometric transformation, which is widely used in image compression, image conversion, target recognition, etc [38]. In recent years, affine transformation has been used in data management and data compression [3,39,40]. For example, Sath et al. in [39] utilized the affine relationship of time-series data to integrate similar conversion features into data processing, thereby achieving high-precision time-series data statistics calculation. Wu et al. in [1] used affine transformation to achieve linear compression storage of electricity consumption data. We use the affine relationship between the data collected by UAV to achieve data compression storage, thereby reducing the time and cost of UAV data collection.

RELATED DEFINITIONS
In this section, we first give some definitions related to the dominant dataset. Second, we briefly introduce the concept and formal definition of the affine relation model. Third, we define the dominant dataset selection problem. Table 1 lists key notations in this paper. Time-varying objects in a timeline form the data sequence called time-series data (TSD), which we refer to as the timeseries sample objects. We use the matrix X m×n to represent m × n TSD objects, in which there are n sample objects, and each sample object is composed of m observation times. Specifically, the TSD matrix X m×n represents the time series data matrix, where The probability that the reconstructed error is less than Correlation distance measurement function  Information extraction error function n represents the number of samples (e.g. the number of sensor devices that collect data), and m represents the length of the time series. Therefore, the sample object x m×n is a subset of TSD, that is x m×n ⊆ X m×n which represents the time series data collected by a sensor. The formal expression of X m×n and x m×n are as follows:

Dominant dataset
If and only if f (P ) ≐ Y P and X = Y P ∪ P, thus, P is called the dominant dataset of X and f is a reconstruction function. Specifically, we can use such a reconstruction function f to reconstruct a small dataset into a large dataset.

Definition 2. Information Extraction Error: Let the dominant dataset P reconstructed by the reconstruction function
Then we give the formal definition of the information extraction error  between S and S ′ as follows: Definition 3. ( , )-solver: Given the parameters and , we assume that the small dataset P can be selected as the dominant dataset of X through the reconstruction function f , so that the dataset P can be used to restore the X dataset under the condition that the probability of information extraction error being less than is not greater than . We define this solution condition of the dominant dataset selection problem as ( , )-solver. In this context, we use the ( , )-solver to control the accuracy of the reconstructed dataset. Specifically, the probability that the error between the reconstructed data point and the real data point is greater than is smaller than to control the accuracy of the reconstructed dataset. If = 0, it means that the information extraction error is less than . In this case, the solution condition is defined assolver.

Definition 4. Correlation Distance:
We assume that X and Y are a pair of TSD, and the correlation between them can be determined by the distance between them. to measure the degree of correlation between X and Y , the correlation distance between X and Y is defined as D(X , Y ). If it is possible to dominate between X and Y , then the condi-

Definition 5. Central Object and Target Object:
According to Definition (1), the object in the dataset P is the central object, and the object in the dataset Y P is the target object.

Affine relation model
Let the coordinates of point o be ⃗ x, and ⃗ x is linearly transformed to obtain A⃗ x. Due to the invariant principle of linear transformation at point o, A⃗ x can be expressed as A⃗ x + b by moving to point b. Therefore, the formal definition of affine transformation is as follows: According to Definition (5), we assume that S ′ and P represent the sample object matrix and the centre object matrix of TSD, respectively, then the affine relation model is defined as X = P × A + B, where A is the coefficient matrix and B represents the residual matrix. We , thus the n-dimensional affine relation model can be represented as follows:

Dominant dataset selection
In this work, the large-scale TSD data collection problem based on UAVs is converted into a small-scale data collection problem that needs to meet the extraction accuracy requirements of the dominant dataset in the original time-series dataset. According to Definition (2, 6), we can use the ( , )-solver to convert a large-scale dataset into a small-scale dataset (i.e. the dominant dataset), and we can use the reconstruction function f to restore the large-scale dataset. Therefore, UAVs can perform efficient data collection by storing the dominant dataset.

SYSTEM MODEL
Consider a data collection system based on mobile Internet of UAVs including UAVs with limited storage capacity and sensor devices, where UAVs are used as data collectors. Sensor devices are randomly deployed in the air, and each sensor device can sense the surrounding time-series data (e.g. temperature, moderation, air quality index etc.). UAV collects data by loading data sensed by sensor devices [4]. Once UAV collects the data stored in the sensor device, the data stored in the sensor device will be deleted immediately [2]. However, limited by the storage capacity of UAV, the UAV cannot store large amounts of time-series data and cannot collect all time series data in one trip. Therefore, UAV needs to seek an efficient data compression storage system to achieve a large amount of data exchange with DC.

Our proposed architecture
As shown in Figure 1, the proposed data compression storage system includes a time-series data collection module, an affine transformation module, and a time-series data compression storage module. The functions of each module are as follows: • Time-series Data Collection Module: In this module, UAV collects data by reading data sensed by sensors deployed in the air. Once the UAV reads the data, the sensor deletes the data directly. The output of this module is the input of the next module.

Problem statement
In this section, we formally define problems related to UAVs data collection as follows: (1) Dominant Dataset Selection Problem: We assume that the dataset X m×n = (x 1 , x 2 , … , x n ) m collected by the sensor device exists a transformation function f and a ( , )-solver such that the affine transformation relationship between the dominant dataset S ′ and the dataset S is established, that is S ′ = P × f , |X − S ′ | ⩽ . Therefore, the formula for the transformation function is as follows: where and (2) Minimize the Dominant dataset Size Optimization Problem: Let the information loss be  = |S -S'|, if  is less than , then the goal of optimization is as follows: where Pr(⋅) is a probability function. According to Equation (11), UAVs can store the smallest dominant dataset and guarantee a high restoration accuracy.

Design goals
In this article, our goal is to develop an efficient data collection framework for UAV. Similar to most data collection frameworks, the proposed framework enables efficient data collection for UAVs. First, the proposed framework needs to be able

FIGURE 2
The overview of the time-series data collection module to pre-process the collected data. Furthermore, the framework uses clustering methods to find pairs of strongly correlated data. Second, the proposed framework can significantly improve the efficiency of data collection using affine transformation. Third, the time and cost of the proposed framework cannot be higher than the traditional framework.

Time-series data collection module
As shown in Figure 2, the time-series data collection module consists of three processes: data preprocessing, data clustering, and data outputting. First, sensors collect and store sensing data at regular intervals. Second, sensor needs to cluster time-series data. Specifically, we use the K-means clustering method to cluster TSD [41,42]. Third, the output of this module is used as the input of the next module. UAVs travel to the sensor to read the data collected by the sensor.

Data preprocessing
The sensor will remove missing values and outliers from the collected data. Furthermore, the sensor will store the data in time series.

Data clustering
In order to facilitate UAVs to extract the dominant dataset, sensors need to cluster the collected data. Specifically, sensors cluster time-series data and select the data centres in each cluster. The reason is that the clustering process can divide data with high similarity into multiple clusters, and the data pairs in each cluster have a strong correlation. We use K-means clustering algorithm, which can effectively cluster the collected data. The formal definitions related to the proposed algorithm are as follows: Definition 7. European Distance: There are two points X i and X j in the European space, and their linear distance in the m-dimensional space is defined as follows:

Definition 8. Sum of Squares for Error (SSE):
Let the average value of the samples in the k clusters beX t . The error between each sample andX t indicates the degree of clustering, and its formula is as follows: where X represents the sample point corresponding to the i-th cluster S i , and X i represents the average value of the samples in the i-th cluster S i , as follows:

Data outputting
The sensor stores the clustered data separately in clusters and outputs it as the input of the next module. This is very important for UAVs to efficiently extract the dominant dataset. This is because the data in each cluster has a strong correlation.
The steps of this algorithm are described as follows: • Step 1, Initialization: The proposed algorithm randomly extracts k initial clustering centres from the time-series dataset. • Step 2, Calculation: According to definition (7), the distance between the remaining sample points and each cluster centre is calculated separately, and each sample point is assigned to the cluster centre closest to itself. • Step 3, Clustering: According to definition (8), recalculate the average value X i of all sample points in each class to the cluster centre, where X i is the new cluster centre.
• Step 3, Iteration: When the SSE converges to a certain threshold or the number of iterations reaches a custom value, the clustering centre no longer changes, and the algorithm terminates. Otherwise, repeat steps 2 and 3.
The K-means clustering algorithm is thus presented in Algorithm 1.

Affine transformation module
In this module, we will demonstrate that UAVs use affine transformation to extract the dominant dataset of TSD, thereby improving the efficiency of UAVs in collecting datasets and reducing the capacity required for datasets storage.
According to Equations (15) and (16), we can obtain the dominant dataset P. The above derivation is established when the dataset is two-dimensional. Next, we generalize the above equations to N dimensions as follows:

Dynamic update mechanism
When the data collected by the sensor is updated, UAVs need to update the dominant dataset P. In order to be able to dynamically update the dominant dataset, we propose a dynamic update mechanism for UAVs. We assume that the sensor updates the time-series data at time interval t , that is P m+1 = (x 1 , x 2 , … , x n , 1) m+1 and S m+1 = (x ′ 1 , x ′ 2 , … , x ′ n ) m+1 , thus we have: ) .
Therefore, the transformation function f ( m + 1) is solved as follows: Therefore, UAVs can use Equation (19) to update the function f .  Figure 3, draw a circle with radius r and calculate the mean shift of C 0 . Then move C 0 to the position of the shift centre and draw a circle with C 0 as the centre. Finally, calculate the mean of the next round of shift and move the centre of the shift until the iteration termination condition is met.
Second, the steps of the proposed algorithm are as follows: • Initialization: Input time-series dataset X and random cluster centres set . We use Algorithm (1) to cluster the input dataset X . • Selection: Given the ( , )-solver, the dominant dataset selection algorithm based on mean shift is used to extract the dominant dataset according to the above definition. • Output: When step 2 is satisfied, UAVs will store the result of step 2 as the collected dataset. Finally, UAVs transfer the dataset to DC, and DC uses the transformation function f to restore the original dataset. If the data collected by the sensor is updated, UAVs use a dynamic update mechanism to update the function f .
The mean shift-based dominant dataset selection algorithm is thus presented in Algorithm 2. Furthermore, the sights of this algorithm are summarized as follows: • We use mean shift to set the cluster centre in each cluster to accurately find the dominant dataset P. Because in the mean shift method, the average distance between the cluster centre and each point in the cluster is shorter than other methods. • Considering that the algorithm complexity cannot be too high, we use the lightweight mean shift method to find the cluster centre. Because in mean shift, we only need to calculate the distance between the point and the cluster centre once. • The mean shift method meets the definition of solver we designed, which helps us find the dominant dataset.

EXPERIMENTAL RESULTS AND ANALYSIS
In this experiment, we analyze the effectiveness and efficiency of UAVs using the proposed dominant dataset selection method. Given the ( , )-solver, the performance of the dominant dataset selection algorithm based on mean shift is evaluated by affine transformation model. Furthermore, the reconstruction accuracy of the original dataset is also analyzed in the experiment.

Experimental setup
In this experiment, we use the Python programming language on the Anaconda Navigator platform to implement the proposed algorithm [44]. We built an experimental virtual environment including an Intel i7-6770 CPU and 16GB RAM Ubuntu 18.0 PC. The experimental dataset is from the real precipitation dataset in Harbin, China in July 2019. Specifically, the sensor collects precipitation data at each station every 30 min. The dataset contains 31 days of precipitation data from 100 monitoringbrk stations.
In this section, we focus on the time efficiency of UAVs collecting data and the consumption of UAVs storage capacity. In addition, the accuracy of the dataset reconstructed by DC through the transformation function f is also one of the indicators we focus on.

Performance analysis of the proposed algorithms
In this section, we first investigate the performance of the proposed algorithm. We compare the affine transformation (AF) algorithm with the scheme based on least squares transformation (LST), the scheme based on singular value decomposition (SVD) and the scheme based on linear regression transformation (LRT). Note that we fixed the size of dataset as 100 MB.
First, we illustrate the efficiency of the proposed algorithm by comparing the time cost of different schemes. As shown in Figure 4, the time cost of the proposed algorithm is about 50% of the worst-performing scheme. The reason is that the affine transformation originates from the spatial transformation in the image transformation, which can realize the linear transformation of high-dimensional data. However, other schemes are applicable to low-dimensional data but not to high-dimensional data. The data collected by sensors is often high-dimensional and complex. Therefore, the proposed algorithm outperforms other solutions in terms of time cost.
Second, we illustrate the compression performance of the proposed algorithm by comparing the size of the dominant dataset extracted by different schemes. From the experimental results, the proposed algorithm can extract the dominant dataset of 25.6 MB from the original dataset of 100 MB. As shown in Figure 5, LST-based schemes, SVD-based schemes and LRTbased schemes extract dominant datasets with sizes of 62.1, 58.4, and 61.4 MB, respectively. In the proposed algorithm, we use a mean shift-based dominant dataset selection method to efficiently extract the dominant dataset. This is because mean shift can aggregate data pairs of similar features, which can reduce the amount of information needed to extract the dominant dataset. Other schemes do not use clustering algorithms, so they need to extract more information to ensure the accuracy of the restoration of the dominant dataset.
Third, we compare the accuracy of the four schemes to reconstruct the original dataset using the dominant dataset. This is because the accuracy of reconstructing the original dataset is a very important indicator of the evaluation algorithm. Figure 6 and Figure 7 show the experimental results. From the experimental results, it can be known that the accuracy of the proposed algorithm is higher, which is 15.8% higher than the Fourth, we comprehensively analyze the algorithmic time complexity of the proposed method and the benchmark methods. The algorithmic complexity of all methods consists of the running time cost of each statement. For the variable initialization phase, the time complexity of all methods is O(m × n). For the dominant dataset selection phase, the time complexity of AF and LST methods is O(n 3 ), and the complexity of other methods is O(n 4 ). The reason is that other methods need to perform complex matrix factorization, which is computationally very expensive. Therefore, the SVD and LRT methods have greater time complexity than the AF and LST methods. In addition, the LST method has less information extraction error than the AF method. In short, the AF method has great advantages in accuracy and time complexity.
The above experiments confirmed the performance of our proposed algorithm. Therefore, UAVs can use the proposed algorithm to efficiently collect data.

Effects of parameter on the dominant dataset selection algorithm
We further analyze the effects of parameter on the dominant dataset selection. In this experiment, we set ∈ {1%, 3%, 5%, 8%, 10%} and fixed parameter . We need to explore the effect of on the performance of the algorithm so that the value of can be set scientifically. We take reconstruction accuracy as the evaluation index. It can be seen from the experimental results that the proposed algorithm has the best performance when = 5%. If is too large, the reconstruction accuracy will be relatively low. This is because controls the accuracy of the extraction of the dominant dataset. When is relatively small, this is very strict for extracting the dominant dataset, so the amount of information extracted is not enough, and the reconstruction accuracy is not high. Therefore, we need to set is equal to about 5%.

CONCLUSION
In order for UAVs to efficiently collect data, we propose a dominant dataset selection algorithm based on affine transformation. Specifically, UAVs improve the efficiency of data collection by storing the dominant dataset with a smaller volume instead of the original dataset. After the UAVs dump the dominant dataset to the DC, the original dataset is also restored by the DC. Experimental results show that the proposed algorithm has low time cost and low storage cost. For example, Figure 5 shows that the accuracy of the proposed algorithm reaches 91.2%. The performance of the proposed algorithm is mainly due to the use of mean shift method to improve the similarity between data pairs. In the future, we will apply the federated learning [19,20,[45][46][47][48] framework to UAVs to collaboratively collect data. Currently, the UAV-based data collection framework does not consider user privacy, so this is a promising direction.