Data privacy‐based coordinated placement method of workflows and data

With the rapid development of data acquisition technology, many industries data already have the characteristics of big data and cloud technology has provided strong support for the storage and complex calculations of these massive data. The meteorological department established the cloud data centre based on the existing storage and computing resources and re-arranged the historical data to reduce the historical data access time of applications. However, the placement of each workflow and input data also affects the average data access time, which in turn affects the computing efficiency of the cloud data centre. At the same time, because of the collaborative processing of multiple nodes, the resource utilisation of cloud data centre has also been paid more and more attention. In addition, with the increase of data security requirements, some privacy conflict data should avoid being placed on the same or neighbouring nodes. In response to this challenge, based on the fat-tree network topology, this study proposes a data privacy protection-based collaborative placement strategy of workflow and data to jointly optimise the average data access time, the average resource utilisation, and the data conflict degree. Finally, a large number of experimental evaluations and comparative analyses verify the efficiency of the proposed method.


Background
With the advancement of data acquisition technology and the improvement of industry service requirements [1][2][3], the number and types of industry data continue to grow, and many industry data already have the characteristics of big data [4,5], such as the traffic data, financial data, meteorological data and so on. Based on these massive industry data, the computational complexity of various industrial applications is also increasing [5]. So cloud technology provides strong support for the storage of these massive industry data and the complex calculations of various industry applications. Based on the existing storage and computing resources, many departments established the cloud data centre and offloaded a large number of industrial applications and data to cloud data centre for execution and storage [6,7]. The powerful computing performance of cloud centre helps to improve the computing efficiency of various applications, but the data transmission between the nodes in the cloud centre also results in additional data access time, especially meteorological data. Whether from the time dimension or the spatial dimension, meteorological data is a kind of high-density big data. As the most important data source for all meteorological workflows, many meteorological workflows usually need access to large amounts of historical data from multiple storage nodes, which also leads to the dependence of meteorological workflows on massive historical big data is much greater than other industry workflows. Therefore, in order to reduce the average data access time of all meteorological applications, the meteorological department has analysed the characteristics of massive historical data and rationally distributed meteorological historical data to each storage node [8,9]. However, the data access time of each application depends not only on the placement of historical data but also on the placement of each workflow and input data. Based on the overall placement of meteorological historical data, we continue to study how to collaboratively place the input data and tasks of workflows on the corresponding nodes in cloud centre [8,10], thereby further reducing the average data access time for all tasks in workflows [11,12]. As the number of meteorological workflows and data offloaded to cloud centre increases rapidly [13,14], increasing the resource utilisation of active nodes in cloud centre is also being paid more and more attention [15,16], and it has become an important indicator to measure the performance of placement method [17,18]. In addition, with the improvement of the confidentiality of meteorological data, there are some privacy conflicts between meteorological data, so that these conflicting meteorological data should avoid being placed on the same or neighbouring storage nodes to ensure the security of these privacy data [19][20][21]. Therefore, while improving the resource utilisation of nodes, the placement of conflict data is also receiving increasing attention. All in all, the collaborative placement of workflows and data based on data privacy has become a challenge. In response to this challenge, this paper proposes a data privacy-based optimisation method for collaborative placement of workflows and data.

Paper contributions
In this paper, we have made the following contributions: • Based on the fat-tree network topology, each meteorological application is modelled as a workflow which is a set of sequential tasks, and each task in workflows has a certain dependency on the corresponding historical data or input data. • Based on the placement of historical data and the conflicted relationship of private data, the coordinated placement problem of meteorological workflows and data is modelled as a constrained multi-objective optimisation problem. In addition, as the three optimisation goals of this optimisation problem, the data access time indicator, the resource utilisation indicator and the data private conflict indicator are all modelled and formulated. • Aiming at this multi-objective optimisation problem, we propose a coordinated placement optimisation method based on non-dominated sorting differential evolution (NSDE) algorithm and adopt the simple additive weighting (SAW) method and multiple criteria decision marking (MCDM) method to determine the final optimal coordinated placement solution. • The most commonly used coordinated placement optimisation method currently in the meteorological department and the same type of optimisation method as NSDE algorithm recently proposed by other researchers are selected for performance comparison.
The remainder of this paper is organised as follows. Section 2 mainly analyses the meteorological fat-tree network structure and a typical meteorological workflow. The coordinated placement problem is modelled as a constrained multi-objective optimisation problem in Section 3 and is solved based on the NSDE algorithm in Section 4. Section 5 compares and analyses the experimental results. Section 6 summarises the related work. Finally, we outline the conclusions and future work in Section 7.

Meteorological fat-tree network
The meteorological network usually adopts the tree structure, but the bandwidth in the traditional network is converged layer by layer, and the network congestion is likely to occur. Therefore, based on the traditional tree network structure, the fat-tree network structure has been proposed and has been widely adopted by the meteorological department. The fat-tree network structure is divided into three layers from top to bottom: core layer, aggregation layer and edge layer. The aggregation layer switches and the edge layer switches form a pod. The bandwidth of the fattree topology network is not convergent so that it can provide high throughput transmission service for the meteorological data centre and ensure that the network is non-blocking. In addition, there are multiple parallel paths between any two nodes, so the fault tolerance of the network is good. In actual meteorological applications, a public meteorological cloud data centre is constructed based on the virtualisation technology and the fat-tree network topology. The switches in each department constitute a pod, or all the switches of several adjacent departments constitute the same pod, and each pod connects to the servers of the department to which it belongs. According to the rules of fat tree, if the meteorological cloud data centre contains N pod pods, the number of edge switches and aggregation switches in each pod is N pod /2, the number of servers that can be connected in each pod is (N pod /2) 2 , and the number of core switches is also (N pod /2) 2 . Fig. 1 shows a meteorological fat-tree network topology with four pods. In practical applications, the network size of the meteorological department is usually much larger than this.

Meteorological scene and workflow description
In the current meteorological big data cloud processing mode, in order to improve the service efficiency of massive meteorological historical data and reduce the average data access time. The meteorological department analyses the characteristics of historical data that have been the most important data source for various meteorological applications, and reasonably stores the historical data in certain fixed storage nodes. In addition, user input data that can be dynamically placed during applications' execution is also an important data source for each application. Therefore, based on the placed meteorological historical data, the coordinated placement of workflows and input data in the meteorological applications is completed, so that the average data access time and the data conflict degree are minimised, the average resource utilisation of all active nodes is maximised.
Based on workflow technology, each meteorological application can be modelled as a meteorological workflow and operations in the application can be modelled as a set of tasks in the workflow. Fig. 2 shows the workflow of weather forecast production.
As the starting task, t 0 represents the task of data collection operation, including automatic station data, radar data, satellite nephogram and so on. Tasks t 1 and t 2 represent the historical weather summary operation and historical weather analysis operation, that summarise the historical weather phenomena and analyse the causes of historical weather formation in the past 48 h, respectively. Tasks t 3 and t 4 represent the real-time weather summary operation and real-time weather analysis operation, that summarise the current weather phenomena and analyse the causes of current weather formation, respectively. Task t 5 represents forecast mode calculation operation, the future weather is calculated in real time based on the European Centre for Medium-Range Weather Forecasts (ECMWF) and global forecasting system. Tasks t 6 and t 7 represent the weather situation analysis operation and meteorological elements analysis operation, that analyse the future weather situation and the future meteorological elements based on the calculation results of the forecast model, respectively. Task t 8 represents the generation of forecast model conclusions operation, based on the analysis for weather situation and meteorological elements, the final conclusion of the forecast model is formed. As the termination task, task t 9 represents the generation of weather forecast conclusion operation, based on the analysis of historical weather and real-time weather, combined with the conclusion of forecast model, the final weather forecast conclusion is formed.

Problem modelling and formulation
In this section, based on the fat-tree network topology, we mainly model this data privacy-based coordinated placement problem as a constrained multi-objective optimisation problem and formulise three optimisation goals: the average data access time, the average resource utilisation and the data private conflict degree, respectively.

Problem modelling
Assume that a meteorological workflow consists of M tasks, which can be defined as TS = t 0 , t 1 , t 2 , …, t M − 1 . The data source of meteorological workflow mainly includes P input data and Q historical data, so the input data set and the historical data set can be defined as his . Therefore, the relationship between M tasks and data can be expressed as represents the data set required for the mth task t m . In addition, if there are K pairs of conflicting data, the conflicting relationship between these conflicting data can be expressed as represents the kth pair of conflicting data, and as the conflicting data, data d x and d y should be placed on two storage nodes that are far away in the fat-tree network to ensure the security of these conflicting data.

Data access time model
In the meteorological fat-tree network, it is assumed that task t m and its required data d are stored in the compute node u i and the storage node u j , respectively, and the data amount is d n . The positional relationship between u i and u j can be defined as δ i, j , then: • If u i and u j are the same nodes, then δ i, j = 0.
• If u i and u j belong to the same switch of the same pod, then δ i, j = 1.
• If u i and u j belong to the different switches of the same pod, then • If u i and u j belong to the different pods, then δ i, j = 3.
Therefore, according to the positional relationship δ i, j between u i and u j , the access time T m AC of task t m for the data d n can be expressed as where B se , B ea and B ac represent the bandwidth between the server and the edge layer switch, the bandwidth between the edge layer switch and the aggregation layer switch and the bandwidth between the aggregation layer switch and the core switch, respectively. Also, d n represents the amount of data that need to be transferred. Therefore, the total data access time of task t m for its required data set γ m can be calculated as Then, the average data access time for M tasks can be calculated as

Resource utilisation model
Assume that the number of compute nodes and storage nodes are tvm m col and dvm m col represent the resources required for the mth task and the nth data, respectively. Therefore, the placement of M tasks on the compute nodes can be represented as a two-dimensional array CT[N col , M], and the placement of P + Q data on the storage nodes can be represented as a two-dimensional array SD[N sto , P + Q], then: Then, the resource utilisation of the ith compute node and the jth storage node can be expressed as U i col and U j sto , respectively: If the nodes with non-zero resource utilisation are defined as active nodes, assume that the number of active compute nodes and active storage nodes are N active col and N active sto , respectively. The average resource utilisation of the currently active compute nodes and storage nodes can be expressed as U col and U sto , respectively: Finally, since the number of compute nodes and storage nodes may be different, based on the SAW method, the average resource utilisation U of all nodes is calculated as where v 0 + v 1 = 1. The larger the proportion of compute nodes in all nodes, the larger the corresponding weight v 0 , or vice versa. Therefore, in this experiment, the two weights v 0 and v 1 are set as follows:

Data conflict model
The closer the privacy conflict data is placed on the network, the greater the possibility of privacy breaches. Therefore, in order to ensure data privacy, conflicting data should be prevented from being placed on the same or neighbouring nodes. Based on current data placement, assume that there are N SN pairs of conflicting data placed on the same node. There are N SS pairs of conflicting data placed on the different nodes of the same edge layer switch. There are N SP pairs of conflicting data placed under different edge layer switches of the same pod. We set the corresponding weights w 0 , w 1 and w 2 for these three placement situations of conflicting data.
Then, the data conflict degree C for all conflicting data can be expressed as where w 0 + w 1 + w 2 = 1, the closer the conflicting data are placed, the larger the corresponding weight. Therefore, in this experiment, the three weights w 0 , w 1 and w 2 are set to 0.55, 0.3 and 0.15, respectively.

Problem formulation and constraint
In this paper, the coordinated placement of meteorological workflows and data with privacy conflict protection has been modelled as a multi-objective optimisation problem. In addition, our optimisation goals are to minimise the average data access time calculated in (3) and the data conflict degree calculated in (13) and to maximise the average resource utilisation of active nodes calculated in (10). Therefore, this optimisation model can be expressed as In addition, since the resources of each computing node or storage node are limited, the used resources of each node cannot exceed its capacity. This coordinated placement optimisation problem is also a constrained multi-objective optimisation problem, and the constraint can be expressed as which represents that the resource utilisation rate of any computing node or storage node cannot exceed 100%. Finally, the symbols used in this work are summarised uniformly in the Nomenclature.

Problem optimisation
In Section 3, the coordinated placement of meteorological workflows and data has been modelled as a constrained multiobjective optimisation problem. In this section, the coordinated placement method based on NSDE algorithm (CPDE) is proposed.
First, we encode the optimisation problem and generate an initial population as the first generation parental population. Second, based on the parental population, the mutation, crossover and selection operations are continuously performed. Through mutation and crossover operations, the better solutions are constantly looked for in the entire solution space. In the selection phase, we adopt fast non-dominated sorting and crowding distance calculation to select individuals whose objective functions are relatively good to retain to the next generation. Finally, the utility values of multiple excellent individuals are compared based on the SAW method and MCDM method, and the individual with the best utility value is output as the final optimal coordinated placement solution.

Encoding
In the encoding phase, the placement strategies for all tasks in workflows and input data are encoded as the real numbers, each real number can indirectly represent the location where the corresponding task or data is placed. Therefore, the placement strategies for M tasks and P data can be encoded as a placement strategies set X = X T , X OD , where represents the placement strategies of M tasks, and the placement positions of tasks must be the compute nodes. X OD = x 0 OD , x 1 OD , …, x 2 OD , …, x P − 1 OD represents the placement strategies of P user input data, and the placement positions of these data must be those storage nodes. In addition, the placement position X HD of Q historical data is fixed.
Assume that the IDs of N col compute nodes and N sto storage nodes are encoded as [0, 1, 2, …, A − 1] and [A, A + 1, A + 2, …, A + B − 1], respectively. So the range of all genes can be set to [0, A + B). However, since all tasks and input data must be placed on the compute nodes and the storage nodes, respectively, these genes in the individual cannot fully represent the placement positions of the corresponding tasks and data. In order to facilitate the calculation of the NSDE algorithm, each gene in the individual needs to be converted into the ID of the compute node or the storage node. For example, the genes representing the placement positions of tasks need to be converted to IDs of compute nodes between [0, A), and the genes representing the placement positions of input data need to be converted to IDs of storage nodes between [A, A + B). Therefore, the conversion relationship between the gene x i, j and the placement position p i, j is as follows: where the function floor(X) represents X is rounded down. Finally, the individual x i = x i, 0 , x i, 1 , …, x i, M + P − 1 can be converted into a vector p i = p i, 0 , p i, 1 , …, p i, M + P − 1 that can directly represent the placement node ID of each task and data. Table 1 illustrates the individual encoding and its conversion through a simple example, where x i is the encoding of all genes and p i is the placement node IDs of the corresponding tasks or data after conversion by (16).

Objective functions
In this constrained multi-objective optimisation problem, there are three objective functions: average data access time, average resource utilisation and data conflict degree, which are represented by 3, 10 and 13, respectively. However, we need to find a suitable solution that achieves a balance between the three objective functions, so that these optimisation goals are relatively good, not one or two of them are relatively good. The calculation of average data access time and average resource utilisation is illustrated in Algorithms 1 and 2 (see Figs. 3 and 4). then, the NSDE algorithm optimises the population and finally obtains the best placement strategy.
Algorithm 1 specifies the process of calculating the average data access time of all tasks. First, for each task t m in the task set TS (Line 1), the data access time of t m for each data d k it needs is calculated (Lines 2-4). Then, the total data access time of t m is also calculated (Line 5). Finally, the average data access time of all tasks is calculated (Line 8).
Algorithm 2 specifies the process of calculating the average resource utilisation of all active nodes, including the compute nodes and the storage nodes. First, the resource utilisation U i col of each compute node is calculated based on the placement location of all tasks (Lines 1-3). In addition, the resource utilisation U i sto of each storage node is calculated based on the placement location of all data (Lines 4-6). Then, the average resource utilisation of all the compute nodes and the storage nodes is calculated (Line 7), respectively. Finally, the average resource utilisation U of all nodes is calculated based on the number of compute nodes and storage nodes (Line 8).

Optimising problem using NSDE
As an efficient population-based multi-objective optimisation algorithm, NSDE algorithm is adopted to optimise this multiobjective optimisation problem. First, we need to initialise an initial population as the first parental population.

Initialisation:
Assume the size of the population is NP, so this initial population can be expressed as X = X 0 , X 1 , …, X i , …, X NP − 1 , where X i is the ith individual of the population and represents a placement strategy for all tasks and data, so the size of the individual depends on the total number of tasks and input data. If this optimisation problem has M tasks and P user input data, then X i can be expressed as that represents the placement strategies for M tasks and P user input data, and the placement strategy of each task or data is a gene in the individual X i .
Based on the initial population as the first generation parental population, NSDE algorithm begins to perform mutation, crossover and selection operations recurrently.

Mutation:
In the mutation phase, first, three different individuals X a , X b and X c are randomly selected from the population X. Then, based on the mutation factor F, the mutation individual H i is generated by combining the difference vector of the two individuals with the third individual. Therefore, the mutation operation can be expressed as Finally, based on (17), we can generate the mutation population H = H 0 , H 1 , …, H i , …, H NP − 1 whose size is also NP.

Crossover:
In the crossover phase, according to the crossover factor CR, we select several genes from the mutation population H and the parental population X to form the crossover population R = R 0 , R 1 , …, R i , …, R NP − 1 whose size is also NP. First, in order to ensure that at least one gene in the crossover individual R i is from the mutation individual H i , we need to randomly select a gene H i, j from H i and keep it to R i, j . Then, for other genes in the crossover individual R i , based on the crossover factor CR, we choose genes from the parental individual X i and the mutation individual H i , respectively. Therefore, the crossover operation can be expressed as Finally, based on (18), the crossover population R = {R 0 , R 1 , …, R i , …, R NP − 1 } is generated.

Selection:
In the selection phase, the parental population X and the crossover population R are merged into a new population Y = Y 0 , Y 1 , …, Y i , …, Y 2NP − 1 whose size is 2NP. According to Based on the fast non-dominated sorting operation, all individuals in the population Y are divided into multiple dominant layers L i (i = 0, 1, 2, …). In addition, all individuals in the L i layer can completely dominate all individuals in the L i + 1 layer. This also means that all individuals in L i are better than all individuals in L i + 1 considering any one performance indicator. Therefore, individuals in the L i layer are more likely to be retained in the next generation population than individuals in the L i + 1 layer. However, in the same dominating layer L i , we perform the crowding distance calculation for each individual, the individuals with better crowding distance can be preferentially retained to the next generation population.
Therefore, individuals in the better dominating layer and individuals with better crowding distances in the same dominating layer are preferentially retained to the next generation parental population X until the size of X is also NP.

Iteration:
In the selection phase of each generation evolution, the fast non-dominated sorting is used to generate each generation Pareto optimal solutions set for producing the next generation parental population X. In addition, NSDE continuously performs mutation, crossover and selection operations on this parental population X. Finally, multiple excellent individuals are obtained after the last generation of evolution.

Utility value comparison:
For the multiple Pareto solutions generated after the optimisation of the NSDE algorithm, we also need to perform the utility value comparison for choosing an optimal solution as the final solution of this coordinated placement problem. However, these pareto solutions are not dominated by each other, so it is difficult to accurately select an individual as the final solution. Therefore, the SAW method and MCDM method are employed to evaluate these individuals and help us select the final optimal solution.
If T avg i , U i and C i represent the three objective function values of the individual X i , respectively, T avg min , T avg max , U min , U max , C min and C max represent the minimum and maximum of the three objective function values of all individuals, respectively. Therefore, the utility value of three objective functions can be expressed as the following: According to the importance of these three utility values, the corresponding weights w t , w u and w c are set, respectively. In addition the higher the importance of the corresponding indicator, the greater the corresponding weight and vice versa. Therefore, the aggregate utility value v i of the individual X i can be expressed as the following: where w t + w u + w c = 1 and the larger v i , the better the individual X i is.

NSDE-based coordinated placement optimisation method:
The specific process of the optimisation method we proposed is shown in Algorithm 3 (See Fig. 5). First, the initial population X 1 is initialised, and it is also regarded as the first generation parental population (Line 2). Based on the parental population X g where g is the evolution time, the mutation operation is performed and the mutation population H g is generated (Line 4). We continue to perform the crossover operation and generate the crossover population R g based on the parental population X g and the mutation population H g (Line 5). Then, the parental population X g and the crossover population R g are merged into a new population Y g (Line 6), the fast non-dominated sorting is performed based on this population Y g and all individuals in Y g are divided into multiple dominant layers L i (Line 7). The individuals in better dominant layers can be preferentially retained in the next generation of parental populations X g + 1 . Finally, we calculate the crowing distance of each individual in the same dominated layer L i and keep better individuals in the next generation parental population X g + 1 until the size of X g + 1 is also NP (Lines 9-13). Constant iteration until the termination condition is satisfied, and the excellent individuals set S * is output (Line 16).

Comparison and analysis of experimental results
In this section, in order to evaluate the performance of our proposed CPDE method, we perform a series of experiments and compared the CPDE method with two other common coordinated placement methods. First, we introduce the settings of parameters and these three common coordinated placement methods used in this experiment. Then, the performance of the three methods is compared and analysed.

Parameters setting and comparison methods
In this experiment, the compute virtual machine uses a custom second-generation Intel Xeon scalable processor (Cascade Lake) with a sustained full-core Turbo frequency of 3.6 GHz, while the storage virtual machine uses high-frequency Intel Xeon E5-2686 v4 (Broadwell) processor with a base frequency of 2.3 GHz. Among them, each compute node and storage node has two CPUs, and the memory size is set to 4 and 16 GB, respectively. By means of the CloudSim simulation tool, the collaborative placement strategies are simulated on a desktop PC with Intel Core i7-8700 3.20 GHz processors and 16 GB RAM.
In order to compare and evaluate the performance of different methods, we optimise workflows of different sizes, so the sizes of the workflows are set as 5, 10, 15 and 20, respectively. Also, for data sets of different sizes, the task distribution in the workflows conforms to a normal distribution between 5 and 20, the total number of simultaneously placed tasks reaches between 50 and 250. Finally, the setting of parameters used is as shown in Table 2, and data set settings used are as shown in Table 3, where N Ws , TN w and TN total represent the number of workflows, number of tasks in each workflow and the total number of tasks, respectively.
Besides our proposed CPDE method, we also compare performance with two other coordinated placement methods: the coordinated placement method based on the greedy algorithm (CPG) which is commonly used by the meteorological department and the coordinated placement method based on NSGA-II algorithm (CPGA) which is recently proposed by other researchers, and it is also an optimisation method of the same type as our proposed CPDE method. These three coordinated placement optimisation methods are briefly described as follows: • CPDE: Our proposed CPDE method estimates the average data access time, the average resource utilisation and the data conflict degree comprehensively, and optimises the placement strategy using NSDE algorithm. • CPG: Compared with data conflict degree, CPG method is more concerned with the average data access time and the average resource utilisation. Based on historical data that has been placed, tasks are preferentially placed at the computing node closest to their required historical data to ensure that each task has the shortest average data access time for historical data. Second, based on the placement of each task, the input data are preferentially placed on the storage node closest to the task set to which belongs. In addition, for the storage nodes having the same distance, the input data are preferentially placed on the storage node with the highest resource utilisation. • CPGA: CPGA method is close in function to our CPDE method, this comparison method is mainly proposed by Xu et al. [22]. Therefore, as an optimisation algorithm of the same type as NSDE, it is used to compare the performance of our CPDE method.

Comparison and analysis of method performance
In this section, we compare and analyse the performance of three methods on the three objective functions to demonstrate the superiority of our proposed CPDE method in terms of overall performance. Fig. 6 shows the performance comparison of three placement methods on the average data access time indicator based on four scale data sets. Overall, the difference between the three methods in this performance indicator is not very obvious. First, CPDE and CPGA are two optimisation methods of the same type, although their performance is relatively close, the CPDE method is always slightly better than the CPGA method in the average data access time indicator. Second, based on the Greedy algorithm, CPG method prioritises the historical data access time and then completes the placement of input data based on tasks that have been placed. Therefore, CPG method lacks the consideration of the joint placement of tasks and input data. This method can maintain good performance on the data access time indicator in small-scale data sets, such as the number of workflows is 5. However, with the gradual expansion of workflow scale, CPG method gradually loses its advantage in the data access time indicator, and CPDE method has gradually become the optimal method among the three placement methods. Fig. 7 shows the performance comparison of three placement methods in the average resource utilisation indicator based on four scale data sets. First, CPDE method is always slightly better than CPGA method in this indicator. Second, based on the Greedy algorithm, CPG method prioritises the historical data access time, the decentralised placement of historical data leads to the decentralisation of tasks placement, which in turn leads to the decentralisation of input data placement. Therefore, the resource utilisation of each node is low in small-scale data sets, for example, when the number of workflows is 5. However, as the size of data set gradually increases, the resource utilisation of each node also increases significantly, and the advantages of Greedy algorithm are gradually revealed, especially when the number of workflows is 15, CPG method has the best performance. However, in other cases, for example, when the number of workflows is 5, 10 and 20, respectively, in these three placement methods, the CPDE method always maintains optimal performance. Fig. 8 shows the performance comparison of three methods on the data conflict degree indicator based on four scale data sets. It can be clearly seen that the three methods have a large gap in this performance indicator. First, it is obvious that the CPG method has   5, 7, 7, 9, 9, 11, 11, 11, 13, 13, 13, 15, 15, 17, 19 175 20 5, 7, 7, 9, 9, 9, 11, 11, 11, 11, 11, 13, 13, 13, 13, 15, 15, 17, 17, 19 236   the worst performance in this indicator, CPG method prioritises the average data access time indicator and the average resource utilisation indicator, these two indicators tend to place all tasks and data centrally to ensure the less data access time and the higher resource utilisation. However, it does not consider the data conflict degree indicator, because in order to ensure the smaller data conflict degree, it is necessary to disperse the conflicting data, which contradicts the placement principle of CPG method. Second, CPDE and CPGA are two optimisation methods of the same type. Although their performance is still relatively close on this indicator, the CPGA method is slightly better than the CPDE method, for example, the number of workflows is 5, 15 and 20. CPDE method only shows the best performance when the number of workflows is 10.
In summary, it can be clearly seen that although the performance of three methods differs in the average data access time and average resource utilisation indicators, the gap is not very large. However, in the data conflict degree indicator, the performance of CPG method is far inferior to the other two methods. In addition, CPGA method and CPDE method are based on two optimisation algorithms of the same type. Although their performance on these three indicators has always been similar, more often, CPDE method is superior to the CPGA method. Therefore, based on data sets of different sizes, we make statistics on the optimal placement methods of the three performance indicators, as shown in Table 4. It can be clearly seen that the CPDE method exhibits the best performance most times, followed by CPGA method and CPG method.
Finally, it can be determined that our proposed CPDE method is definitely better than the other two methods, which has been verified in this experiment.

Related work
In order to improve the execution efficiency of applications in cluster optimising the data placement strategy helps to reduce the data access time to the application.
Li et al. [4] proposed a two-stage data placement strategy and adopt the discrete particle swarm optimisation algorithm to optimise the placement of data for reducing data transfer cost. In [7], an adaptive data placement strategy considering dynamic resource change for efficient data-intensive applications was proposed, based on the resource availability, this method can reduce the data movement cost-effectively. Ebrahimi et al. [8] proposed a BDAP data placement strategy, which is a populationbased distributed data placement optimisation strategy. These data placement strategies have a good effect. However, with the rapid increase of applications and data in the cluster, the resource utilisation of equipment is also receiving more and more attention. In [11], based on the limited resources, Whaiduzzaman et al. proposed a PEFC method to improve the performance of cloudlet. In [13], Chen et al. proposed a correlation-aware virtual machine placement scheme to enhance resource utilisation. In addition, ensuring the stability and security of data in the cluster is also receiving increasing attention. In [17], Kang et al. formulated the data placement problem as a linear programming model and developed a heuristic algorithm named SEDuLOUS for solving the security-aware data placement problem. At the same time, some scholars have conducted comprehensive research on these indicators. In [19], the authors proposed a BPRS big data copy placement strategy, which can reduce the data movement of each data centre and improve the load-balancing problem.
However, to the best of our knowledge, there are still few placement strategies that consider the three important factors of average data access time, average resource utilisation and data conflict degree. Therefore, based on these three objectives, this paper proposes to optimise the placement of tasks and data using the NSDE algorithm, and achieved remarkable results.

Conclusion and future work
In order to reduce the average data access time of meteorological applications, the meteorological department mainly places the massive meteorological historical data, but it lacks the optimisation of tasks and data placement for each application. Therefore, first, we model the placement of tasks and data in meteorological applications as a constrained multi-objective optimisation problem. Second, we analysed and constructed the data access time model, resource utilisation model and data conflict degree model and optimised the constrained multi-objective problem using the NSDE algorithm. Finally, the effectiveness of our proposed CPDE method is verified by comparison and analysis of multiple sets of experiments.
Based on the work done in this paper, we also need to continue to optimise the placement of tasks and data to reduce the average data access time and the data conflict degree, while improving the average resource utilisation. In addition, we also need to constantly update our method based on actual performance.