A comprehensive modeling approach for the task mapping problem in heterogeneous systems with dataflow processing units

We introduce a new model for the task mapping problem to aid in the systematic design of algorithms for heterogeneous systems including, but not limited to, CPUs, GPUs, and FPGAs. A special focus is set on the communication between the devices, its influence on parallel execution, as well as on device‐specific differences regarding parallelizability and streamability. We give a comprehensive description on how a given task mapping can be abstractly evaluated including mappings to dataflow‐based hardware accelerators. We show how this model can be utilized in different system design phases and present two novel mixed‐integer linear programs to demonstrate the usage of the model, showing significant improvements compared to pure CPU mapping for randomly generated task graphs. To the best of our knowledge, we present the first ILP for task mapping that considers pipelining effects when streaming tasks on an FPGA.

In our work, we aim to lay a foundation for the development of generalized task mapping algorithms in heterogeneous systems.For this, we develop an abstract model with a special focus on data-intensive applications, where communication costs play a significant role.With this model, we aim to support developers in the early design phases of a heterogeneous system and clear the path for theoretical evaluations and comparisons of task mapping algorithms.In particular, we follow a new algorithm design approach in which the same mapping algorithm can be used throughout all design stages.We demonstrate the capabilities of the model based on two mixed-integer linear programs in a sample environment, which can be used as a reference for future heuristics.This work is a revised and extended version of a conference paper published at HeteroPar'2022 1 with a particular focus on the modeling of streaming behavior as central property of dataflow processing units implemented on FPGAs.
The main contributions of this paper are • A new model-based paradigm for the development of mapping algorithms and the design of heterogeneous systems.
• A comprehensive platform-independent model that serves as the basis for the new paradigm.
• A proof-of-concept realization of the abstract model, that is, a concept for determining system-specific mappings using generalized algorithms.
• Three mixed-integer linear programs for the selection of suitable mappings that may be used as a baseline for the development of future mapping algorithms.
The article is structured as follows.In Section 2, we give a short overview on the state of the art for the task mapping problem with a focus on existing modeling approaches.In Section 3, we elaborate on the underlying design methodology of this paper.In Section 4, we develop a model for the task mapping problem in three different abstraction layers.Section 4.1 gives a high-level abstraction of a heterogeneous system, which is then transferred into a mathematical model in Section 4.2, where we give a definition of the generalized task mapping problem including an easily computable, yet sufficiently complex, cost function.In Section 4.3, we suggest a simple realization of the abstract model for an early system design stage, stating how the necessary values and functions can be derived from general knowledge of the used hardware.In Section 4.4, we provide a short discussion of the modeling of databus usage, whereas in Section 4. 5 we modify the mathematical model to support streaming behavior.Section 5 is concerned with mixed-integer linear programming models, where Sections 5.1 and 5.2 introduce a device-based and a time-based approach and Section 5.3 extends the time-based approach to incorporate streaming.In Section 6.1, we evaluate the mappings resulting from the MILP models based on the previously defined cost function.Finally, in Section 6.2, we compare the predictions for a CPU-GPU environment with actual time values derived through an OpenCL evaluation, before we conclude our work in Section 7.

STATE OF THE ART
The mapping of tasks to processing devices (also called resource/task allocation or workload partitioning) describes a central step in the design of heterogeneous systems.A well-chosen task mapping can greatly improve the performance of a system and enable parallel execution of tasks on different devices.As such, it is closely related to task scheduling, which focusses on the order of the execution.In highly heterogeneous systems it is mainly a matter of perspective, whether task mapping is a substep of scheduling or vice versa.A general distinction can be made between static and dynamic (or, respectively, offline and online) task mapping algorithms. 2 Static algorithms determine one fixed mapping before the start of the computation, whereas dynamic algorithms decide at runtime which computation device should be used for an incoming task.Static task mapping generally requires information on expected execution times for the potential task-device combinations, but needs only one actual implementation for each task in the productive system.In contrast, dynamic algorithms can deal better with uncertainty, but need one implementation for each task-device combination available in the productive system.While languages like OpenCL 3 or OpenMP 4 attempt to unify the programming interface for different devices, device-specific optimizations must still be carried out manually, 5,6 especially if FPGAs are involved. 7The focus of our work lies in the design-time design space exploration for the development of heterogeneous system architectures, for example on a heterogeneous multiprocessor system-on-chip (MPSoC) such as the Xilinx Versal ACAP platform. 8Hence, we focus on the design of static algorithms.However, the model we introduce is not limited to static single chip architectures, but is designed to be used in a broad application area.In the context of our work, the scheduling of tasks cannot change the task-device assignment.Consequently, it is separated from the actual mapping and is not further considered.
Much work exists for CPU-GPU task mapping. 2,9However, research in this field mainly focuses on (application-)specific algorithms or architectures without a reference to a general model or a common measure of cost.This makes it difficult to compare different approaches and to transfer insights to new problems.Some authors introduce more detailed models, 10,11 however, the underlying parallelism of a heterogeneous system is seldom taken into account, especially with respect to the impact of data transfer.Baruah formulates the heterogeneous multiprocessor partitioning problem as a decision problem. 12The goal is to determine whether a mapping of tasks to devices exists, such that the maximum utilization of all devices is not exceeded.The task-device execution times are given through a utilization matrix.As such the definition is closely related to the agent bottleneck generalized assignment problem in the field of production research. 13,14In both problem formulations, the parallel execution through different agents is central, but communication cost and interdependencies between the tasks (agents) are not present.A recent publication by Alasmar et al. bases its work on a similar model, labeling the problem as workload distribution. 15Kwok et al. and Topcuoglu et al. define a heterogeneous task scheduling problem using a simulation-based cost function. 16,17In their model, they take both parallelism and communication cost into account.
However, they do not model contention on memories or processing devices.Emeretlis et al. propose a simulation-based ILP model, which models task dependencies together with both parallel execution and device contention, 18 but without data transfer times.In a recent publication, they extended the model to take communication delays into account. 19While their ILP already considers many important aspects of task mapping in heterogeneous systems, its high computational complexity makes it impractical to function as a cost function for a generalized system model.
Few work is present that includes devices that are able to implement dataflow processing units, such as FPGAs.In dataflow-based processing units, the execution order depends exclusively on the data dependencies between instructions as opposed to processing units whose execution depends on a predefined control flow.Due to their inherent data-driven design philosophy, they are able to process data streams very fast by executing the relevant subtasks in a pipeline.In combination with the area limitations present in FPGAs, this leads to vastly different behavior.Modeling these differences is crucial for exploiting their full potential. 20Yet, works that include FPGAs frequently model them similar to software processing units. 21Owaida et al. discuss these differences in the context of designing OpenCL tasks for FPGAs. 22Theodoridis et al. designed a rather elaborated ILP for task mapping with special focus on hardware platforms. 23They consider communication costs, data dependencies and resource conflicts.For hardware units, they introduce area constraints to the mapping problem and factor in reconfiguration delays, but do not consider pipelining behavior for streamable tasks.Sotiropoulou et al. extend this ILP to reflect memory usage. 24Much work is done in the closely related field of hardware/software partitioning. 25Models in this field better reflect hardware properties, 26 but usually do not differentiate between software units for example, in terms of parallelizability.In order to reduce the development time of hardware-accelerated software systems, there are various efforts to anticipate the expected gain through hardware accelerators using only a high-level software description.Examples for this approach are MPSeeker 27 and AccelSeeker. 28Both tools help in deciding early which subfunctions can be improved most through hardware acceleration, however, they do not consider more complex heterogeneous systems containing of multiple different computation devices.

DESIGN METHODOLOGY
The overall goal of this work is to motivate a new methodology in the design and application of mapping algorithms for heterogeneous systems.We see two core issues present in the current state of the art.First, with regard to algorithm development in the current state of the art, there is a lack of comparability between different task mapping approaches.Evaluations are usually done through runtime in seconds, which is hardly transferable to a different platform.Furthermore, the absence of a common model may make a comparison infeasible even if an algorithm is re-implemented.In consequence, there are few possiblities to objectively assess the quality of an algorithm and to iteratively and sustainably improve mapping heuristics.
Second, a closely related problem is the lack of transferability of task mapping approaches between applications and platforms.Consequently, there are not any widely accepted problem-specific heuristics.Instead, metaheuristics and educated guesses are omnipresent.While the former require careful tuning of parameters, have high runtimes and cannot exploit problem-specific characteristics, the latter are not scalable and error-prone, especially with respect to the increasing complexity of heterogeneous systems.
While the missing comparability mostly affects the development of mapping algorithms, the lack of transferability mostly affects system design.We aim to tackle both of these problems with a new modeling concept.Our goal is to decouple the algorithm development and evaluation process from a specific platform.Figure 1 shows on the right the desired development approach for mapping algorithms based on an abstract platform-independent system model.The abstract model should allow for the evaluation of mapping algorithms without the need of implementation or even ownership of a certain hardware platform.Mapping algorithms that are compatible with the model can then be easily and objectively compared using a benchmark consisting of the relevant parameters of both different platforms and applications.There are two main advantages of this approach.First, the comparison between two mapping algorithms is guaranteed to be both fair and generalizable to different platforms, and, hence, more meaningful.Second, it allows for an enormous reduction in development time of new mapping algorithms, since an actual implementation is no longer strictly required for a basic evaluation.The set-up and development of a heterogeneous system is a very time-consuming process that can take several weeks.In the case of FPGA development, even building just a single configuration may take hours.In the classical approach, every comparison needs to undergo this process.With the model-based approach the evaluation of one data point is a matter of seconds.Naturally, the model must be sufficiently expressive to closely approximate a real system.After an extensive model-based evaluation, promising configurations can be implemented on a specific target platform to exemplarily verify the prediction of the model.However, with a well-evaluated, trustworthy model this step can become optional in the future.
Aside from the presented benefits for algorithms research, using a generalized model also significantly benefits the workflow of an application developer.Classically, the developer guesses a potentially beneficial mapping based on the platform and application structure, implements the mapping and uses profiling to decide whether an improvement could be observed (see Figure 2).The availability of a generalized model enables the creation of easy-to-use mapping libraries, whose algorithms can be applied to realizations of the abstract model.These realizations are platform-and application-specific models that act as specification and parameterization of the generalized model.Since the development of mapping algorithms does only depend on the generalized model, but not on any realization, a specific realization can be easily exchanged without changing the algorithm.
F I G U R E 1 Differences between the classical and our desired approach on the comparison of mapping algorithms.In the classical approach on the left, mapping algorithms are developed based on a specific platform and a platform-specific model.A comparison between two algorithms is done based on a runtime evaluation of selected mapping implementations.In our approach on the right, mapping algorithms are developed based on an abstract model, which allows for a model-based comparison based on predefined benchmarks without any actual system or implementation.
F I G U R E 2 Differences between the classical and our desired approach on task mapping during the design of heterogeneous systems.Classically, the designer develops a mapping strategy based on the target platform and target application.The found mapping is then implemented and evaluated.Based on the evaluation results the designer may then improve their mapping strategy.In our approach, a mapping library can be provided based on a generalized model, such that the mapping algorithm does neither depend on the platform or application, nor on the evaluation.Instead, a platform-and application-specific model realization is used and can be refined after evaluation.
In an early design stage, a realization can be created just based on the specifications without any actual implementation.In this stage, it can support the developer both in the decision which components should be part for the system and which parts of an application should be implemented on the chosen components.In particular, the decision about the acquirement of specialized hardware can be made depending on the expected gain from a heterogeneous implementation.In later design stages, measured data from implementations of selected system components can be used to refine the realization and get more accurate predictions of the actual hardware behavior.Note that the quality assessment of the mapping algorithm as shown in Figure 1 only depends on the generalized model, not on the chosen realization.Hence, while a more accurate realization leads to better results for the specific application scenario, the algorithm used in the early design stage can be used for later design stages as well.

MODELING
In this section, we develop an abstract system model with a minimal set of interfaces that allows us to define a cost function to assess the quality of a given task mapping.We then show how this model can be utilized in different phases of a systematic design space exploration for a heterogeneous system.Finally, we extend this model with a special focus on streaming behavior in dataflow devices.

Concept
In different design phases, different knowledge about the system properties is present, therefore it is crucial to make single components of the model exchangeable without the need to adjust other components or the underlying algorithm.For this, we split the system model into an application An overview of the high-level system model.The application and platform model are conceptually independent.The implementation model acts as an interface between these two models, defining three main functions.In the application model, the small nodes in each task represent input and output memory nodes, while the large nodes represent computation nodes.Analogously, in the platform model, small circles represent memories and large circles represent processing devices.The compatibility function defines which task node can be mapped to which device by assigning a Boolean value to each node-device pair.For the pair pointed at in the example it would evaluate to false and thereby signal that the memory node cannot be mapped onto the computation device.
model, which describes the properties of and relations between tasks, a platform model describing the characteristics of the available hardware, and an implementation model, defining the relation between the available hardware and the application model.
The application model is based on a task graph, that is, a directed acyclic graph, where nodes represent tasks and edges represent data dependencies between these tasks.Similarly to Campeanu et al., 10 we differentiate between computation nodes and memory nodes.While computation nodes indicate that a certain computation must be executed, memory nodes indicate that data must be made available.More precisely, each task consists of three nodes: a memory node representing the input data, a computation node, and a memory node representing the output data (see Figure 3).This representation is based on the assumption that a high amount of data needs to be computed, making memory access mandatory during the execution of each task.It allows us to accurately differentiate between the cost caused by the computation and the cost caused by the memory access.In particular, it allows us to consider different locations for the data.For example, a CPU could work on data provided by the System RAM and write it back directly into the GPU RAM.
In the hardware model, we assume that (1) each computation device is connected to (at least) one associated memory, (2) inter-task data transfer can only happen between memories (not between computation devices), and (3) the computation of a device is blocked by a memory transfer from or to the memories it is currently working on.Usually, the associated memory refers to a respective RAM unit, for example, a GPU RAM for the GPU or the System RAM for the CPU.The model, however, is not limited to one memory unit per device.While the data transfer between different memories is usually done through DMA units, it is still reasonable to assume that computation units are affected by the memory transfer, since they cannot access their respective data.However, if this leads to underusage of a processor or memory, excess data rate can be used to start independent tasks.We elaborate on this in Section 4.4.
The (task) implementation model represents the relation between the application and hardware model.Its main purpose is to work as an interface between those two models and to make parts of the modeling framework more interchangeable.Between each node of the application model and each device, a compatibility relation is defined that indicates which task can be mapped onto which device.Naturally, memory nodes can only be mapped onto memories and computation nodes must be mapped onto a processing device.However, there can be further restrictions.
For example, a cache may only fit memory nodes that contain a small amount of data or a tensor processing unit can only execute a small subset of tasks.In addition to a compatibility function, the implementation model defines how much time is needed to execute a task on a certain device or to transport the output of a task from one device to another.In Figure 3, the special role of the implementation model as interface between the application and Platform model is shown.The execution time as well as the compatibility are relations between task nodes and processing devices.The data transfer time represents a relation between an edge in the task graph and a connection between two devices in the hardware model (which may be modeled as a graph as well).In the following section, the data transfer time is realized as a relation between one task node (the sender) and two devices.
The overall advantage of the described modeling approach lies in the possibility to easily evaluate a given task mapping while abstracting from implementation and platform details.In the following section, we define a cost function based on a simple, but reasonably effective, evaluation algorithm, which manages time values for all devices (including memories) and executes the given tasks based on a predefined topological sorting.

Formalization
Based on these preliminary considerations, we give a formal description of the problem.The notation we establish in this section is used throughout the rest of the paper.We first specify how a basic task graph is extended to contain additional memory nodes.
Definition 1.Let V c be a set of computation nodes and V d be a set of data nodes.We call a triple ṽ a memory-augmented task.For a set of memory-augmented tasks Ṽ we call a directed acyclic graph G = ( Ṽ, E) a memory-augmented task graph.We denote the set of all memory and computation nodes in Based on this task graph, we define a task mapping as a mapping of both computation and data nodes to (processing and memory) devices.
Definition 2. Let G = ( Ṽ, E) be a memory-augmented task graph and let D be a set of devices.We call a function M ∶ V → D a task mapping for G.We call a function c ∶ V × D → {0, 1} a compatibility function.If c(v, M(v)) = 1 for all v ∈ V, we call M valid with respect to c.
The most important part of the model is the cost function, providing an estimation of the performance of a system with a given mapping.It is crucial that the cost of a given mapping can be determined rapidly and still is sufficiently close to a real system for its results to be transferable.The actual runtime of a real system heavily depends on the used schedule.We reflect this by letting the cost of a mapping depend on a given topological sorting.Based on this sorting, we define the cost function recursively using device time values, which simulate the behavior of an execution by managing the last time a device was in use.

Definition 3.
Let G = ( Ṽ, E) be a memory-augmented task graph, let D be a set of devices and let M be a task mapping for be the time required to execute a task on a given device and be the time required to transfer the output data of a node from one device to another.For a given mapping M we shortly write for u, v ∈ V. Finally, let S be a topological sorting of Ṽ ∪ E, that is, a topological sorting of G including the edges, and S(i) denote the ith element in this sorting.Then we define the device time T i of a device p ∈ D at step i ∈ {0, … , |S|} recursively as We call cost S (M) = max p∈D (T |S| (p)) the cost of the mapping.
The cost function is defined in such a way that for each element in the topological sorting, the device times of the associated devices are increased by the cost of the task execution or data transfer, respectively.When all elements have been processed, the maximum over all device times equates to the total delay of the task graph execution.With this cost function, we are able to create a comprehensive definition of the task mapping problem.
Definition 4. Given a memory-augmented task graph G, a set of devices D, functions t, d, c and a topological sorting S, the task mapping problem is the problem of finding a valid task mapping for G with minimal cost.
Note that the decision whether a given mapping is optimal also depends on a given schedule.In practice, we expect that a mapping that has low cost for one sorting also has reasonably low cost for most other topological sortings.The evaluation of the impact of the sorting on the cost function, however, is outside the scope of this work.Unsurprisingly, independent of the topological sorting, the task mapping problem is NP-hard.

Lemma 1.
The task mapping problem is NP-hard.
Proof.We show this by a reduction from the partition problem.Let N be a set of positive integers.For each n i ∈ N create a memory-augmented task ṽi .Let G be the memory-augmented task graph containing all ṽi and no edges and let a set of two computation devices and two memories.Now let Then for any topological sorting, N can be partitioned in two subsets with equal sum if and only if the cost of the minimal valid task With only two computation devices the cost of a mapping cannot get lower than half the total computation cost.If the cost is exactly H then the final device time of both devices must be B and hence the task mapping gives a partition of N. Reversely, if a partition exists, the tasks can be distributed to the devices according to the partition.Since the tasks do not have interdependencies, the cost of the task mapping will be equal to the sum of the computation times per device, which is B. ▪ Algorithm 1. Computation of the total cost of a given task-device mapping.For each element of the topological sorting it is checked whether it is a node or an edge.For each task (line 3 to 5), the computation cost of the task is added to the respective device times, whereas for each edge (line 6 to 8) the device times are increased by the data transfer cost between the tasks 1: function DETERMINECOST(TopologicalSorting, Mapping) 2: for all x ∈ TopologicalSorting do 3: end if end for 11: 12: end function Note that the cost of a given task mapping can be computed in polynomial time.Hence, the associated decision problem, that is, to decide whether a task mapping exists with total cost smaller than a given constant C ∈ R, lies in NP and is therefore NP-complete.
Algorithm 1 shows the computation of the cost of a given mapping following Definition 3.For each device p, a decoupled time value time(p) is managed, which is increased when the device is in use.Tasks are queued for execution according to the given topological sorting.There are two main cost factors.The transportation of data between node u and v, denoted by d uv , and the execution of a task ṽ, denoted by t v (see Definition 3).
Transportation of data happens along the edge between two memory nodes.The time values of both memories are synchronized and increased according to the time given by the implementation model.The time for the execution of a task consists of the time for the read access to the input memory, the write access to the output memory and the computation time on the given device.The time values of all three involved devices are synchronized and the total time for the execution is added to each of them.Note that the input memory waits for the output memory and vice versa to account for the fact that data is processed in small chunks.
After all tasks have finished, the overall cost for the computation is given as the maximum time value over all devices.This value may depend on the used schedule, that is, the order of tasks in the topological sorting.A topological sorting based on a breadth-first-search often reflects the inherent parallelism of the task graph sufficiently well.A potential bias can be circumvented by choosing the topological sorting at random.The complexity of the algorithm is linear in the number of nodes and edges of the given graph.

Models for different design stages
The high abstraction level of the model presented in Section 4.1 allows the designer to reuse optimization algorithms written for this model in different design stages.In an early design stage, the time for task execution and data transport can be determined based on superficial characteristics of the given tasks and potential devices.This allows for a rapid estimate on the required characteristics for a performance gain and, in consequence, supports the designer in their hardware choice.In a later design stage, promising tasks may be implemented and measured on different devices.
With these more precise values, the same algorithms can support the designer in finding the optimal configuration.
We present a simple realization of the abstract system model that can be used during an early design stage.In particular, we describe a more detailed hardware and application model that fulfills the specifications demanded by the abstraction.The model is primarily based on the task sizes of the given application and the processable data rates of the devices.The general idea is to get an estimate of the processing time of a certain amount of data based on device characteristics.Each task node is attributed with a data processing function, which computes the amount of output data generated from input data of a certain size, for example, a simple sum of two values would have a 2:1 relation between input and output data.
In addition, each node has a complexity function, which determines the amount of computations needed based on the input data.Finally, each computation node indicates which percentage of its execution time is parallelizable.For the sake of simplicity, we assume that the parallelizable part is fully parallelizable with an arbitrary amount of processors.
In the hardware model, we compute the data rate of a memory as the product of (1) the bus clock speed, (2) the bus width and (3) the number of memory channels.We set the serial data rate r s of a processing device to the clock rate multiplied with a device-specific overhead penalty, describing Strongly simplified realization of the abstract system model in an early design stage.In the application model, data processing functions f i are used to determine the amount of data that needs to be processed for each task.In the platform model, data rates are given for each device and each connection between devices.In the implementation model, the computation and data transfer times are computed as a simple quotient of the amount of data and the given data rate.The complexity function and parallelization factor are omitted for the sake of simplicity.
the overhead caused by the microarchitecture.Note that a penalty is relevant only if the overhead is expected to be vastly different between devices.
For the evaluation in Section 6.2, we apply penalties as part of a refinement based on basic measurements of the target system.With a more elaborate model, however, estimations of these penalties can already be derived before any actual implementation.In addition to the serial data rate, each processing device is assigned a parallelization factor r p consisting of (1) the number of cores and (2) the potential data parallelism.For example, in case of a GPU, the second factor equals the number and width of SIMD units.
Finally, in the implementation model, we set the execution time t of a task node on a device to 0 for a memory node and to data in ∕(r s * (1 − p + pr p )) for a computation node, where p ∈ [0, 1] denotes the parallelizability of the task.The data transfer time d is determined as the quotient of the output data size and the data transfer rate between the connected devices, which in turn is set to the minimum of the data rates of the respective devices and a potential data rate limitation between them.It is set to infinity if no edge is present in a given hardware graph.Figure 4 depicts the basic principle behind this realization.The execution time and the data transfer time are shown in their simplest form, without including the parallelization factor or the task complexity function.However, they can be made arbitrarily complex in order to better reflect the actual behavior.Note that, in order to lead to an optimal mapping, these functions do not need to calculate the actual execution times of the real system.Rather, the relation between the transfer and execution times of different tasks and devices must reflect the relation between these times in the real system.For the experiments performed in Section 6, we further extend this realization by introducing a factor for the streamability of a task.
Using this model, an early assessment of the potential of a heterogeneous implementation can be made.In a later design stage a measure-based model in the form of a utilization matrix should replace these rough estimates.For this, (time) complexity functions for both the execution and the transport time should be derived from the measured data, which can then be directly incorporated into the task implementation model.Using appropriate penalties, a mixture of both models can be used if measured data is not available for all task-device combinations.

Extension: Full usage of data busses
Data transport between two memories is usually done through DMAs, which are independent of the processing devices.Hence, processing devices are in principle able to execute tasks during the transport of (independent) data.In the presented abstract model, on the other hand, we wait until the input and output memories are unoccupied before we start another execution.The reasoning behind this decision is that during processing, data must be accessed by the processing device and therefore access to the memory bus is needed.However, a data transaction does not always use the full data rate of both memories.If, for example, memory is transferred between System RAM and GPU RAM, the transaction speed is usually limited by the bandwidth of the GPU RAM.The remaining bus width of the System RAM can be used by a processing device to access data.
The resulting gain in performance can be incorporated into the model by adjusting the blocking time according to the used resources.Let r 1 , r 2 be the data rate of two devices p 1 , p 2 with r 1 ≤ r 2 .Then a data transport between these two devices that takes time t increases (after synchronization) the time value of p 1 by t and of p 2 by The increase in the time value of p 2 represents the time that the device would work if it could use all of its resources for the task, that is, the total delay that a parallel execution of other tasks accessing p 2 would experience.Note that the additional capabilities can only be used by independent computations.A task that is dependent on the data transport between p 1 and p 2 will not be able to make use of the free resources.Hence, the cost computation algorithm must assure that a dependent task waits the full time t until its computation is started.

Extension: Virtual memory and streamability
In the current model, we write data back to the memory after each task execution.Depending on the granularity of the tasks, this may be inefficient if a subsequent task is executed on the same device.If a task works only locally on the given data, we may do several subsequent processing steps on the same data before writing it back to memory.These tasks are called streamable.We can model this behavior in two ways: (1) we modify the cost function to ignore memory accesses between subsequent tasks that are executed on the same device and do not produce intermediate data used by other devices or nonstreamable tasks, or (2) we introduce virtual memories into the hardware model with zero access time from the chosen device and infinite data transfer time to other devices.Virtual memories can then be used in between operations on the same device to hide the memory access.The first variant increases the complexity of the cost function, whereas the second variant shifts the responsibility to the mapping algorithm.A special case for streamability is the handling of devices that can implement dataflow processing units, such as FPGAs.Here, not only the memory access can be omitted, but also the execution of tasks can be pipelined, that is, operations can be executed in parallel along the stream.
Therefore, a subtree of streamable tasks on such a device will only take as long as the most expensive processing or memory node in the subtree.A limitation to this property is given by the limited area on such a device.To integrate this behavior into our model, we introduce an area requirement for all tasks and modify the cost function to compress subtrees up to the size of the respective device to one single task.For the latter, we need a notion of compressability.
If no vertex or edge can be added to Ṽ′ without violating these conditions, G ′ is called maximal.
The reasoning behind this definition of compressability is that tasks can be executed together if all of them are mapped to the same device and share the same dependencies.If the dependencies of parts of the subgraph are already fulfilled, it might be beneficial to already execute these parts (and hence, split the subgraph) instead of waiting for the other parts.Similarly, if a (computationally less intensive) part of a subgraph is already sufficient to fulfill the dependencies of a task on another device, it might have a detrimental effect on the overall performance to stream its output to an expensive task and, hence, to prolong the time until the dependency can be fulfilled.Note that we do not impose restrictions on the memories used in a compressable subgraph.This is due to the fact that, in principle, every memory can be used in a streaming fashion.However, streaming through a memory will cause the memory to be occupied during the whole streaming operation.Hence, using an unfitting memory will be penalized by the cost function.
In Figure 5, maximal compressable subgraphs are shown for a given mapping (indicated by the node color) and three different topological sortings (indicated by the labels).In Figure 5A, a topological sorting based on a breadth-first search is shown.In this example, the task at position 2 cannot be included in the subgraph since it precedes the incoming edge at position 3 in the sorting.Similarly, the task at position 13 is not included since it is executed after the outgoing edge at position 12.As explained above, these exclusions are more beneficial, the larger the computation times of the tasks at position 1 and 13 and hence the potential waiting times for the tasks at position 2 and 14, respectively, are.
A topological sorting that leads to a maximal compressable subgraph including all nodes of the device in question is shown in Figure 5B.The blue nodes and their incident edges are ordered in a way that they fully precede or succeed the subgraph.Conversely, in Figure 5C the upper blue node has position 3 in the sorting and is therefore executed as late as possible while the bottom blue node is executed as early as possible with position 9, leading to a small compressable subgraph of only two tasks.While the streaming capabilities of the device are badly exploited in this example, this sorting is advantageous if the path with labels 3 − 9 is part of the critical path of the overall application.

(A) (B) (C)
F I G U R E 5 Maximal compressable subgraphs (marked in red) for three different topological sortings of the same graph.Tasks in green are mapped to the device for which a compressable subgraph should be determined, tasks in blue belong to another device.Labels indicate the position in the topological sorting.Depending on the sorting, two, three or all five green nodes are part of the maximal compressable subgraph containing the task ṽ, which is at position 5 in each sorting.
F I G U R E 6 Example for a task mapping where not all connected compatible nodes can be added into one compressable subgraph.
While it may at first glance seem superfluous to outsource the node selection to a smart selection of the topological sorting, there are situations in which no obvious decision is possible.Figure 6 shows a mapping in which not all of the compatible connected nodes can be put into the same compressable subgraph.With the shown sorting, the tasks at position 5 and 8 can be used for streaming.With a different sorting, the tasks currently at position 1 and 5 could be used instead.However, it is never possible to stream between all three tasks, since the node at position 4 must always be evaluated in between their execution.With the constraints set in Definition 5, streaming in the resulting subgraph is always guaranteed to be feasible in a real system.
With the notion of compressable subgraphs, we can modify the cost function to incorporate streaming behavior.For this, we first compress all parts of a given memory-augmented task graph G that are mapped to a streaming-capable device.Let M be a mapping and S be a topological sorting of G. Then for each streaming-capable device p ∈ D we execute the following steps: 1. Find the topologically first unmarked node ṽ = (v in , v c , v out ) ∈ Ṽ in the sorting where M(v c ) = p.
2. Find a maximal p-compressable subgraph G ṽ of G that contains ṽ and no marked node (Algorithm 3).
3. Mark all nodes and edges of G ṽ as part of this subgraph.
4. Remove all nodes and edges of G ṽ , except for ṽ, from the topological sorting S.
We call the node ṽ with topological index smaller than every other index in a subgraph G ṽ the topological base of G ṽ .After executing these steps, the topological sorting contains only one entry for each compressed subgraph, namely its topological base.For the evaluation, only the topological sorting of the memory-augmented task graph is needed (see Algorithm 1), hence we can easily modify the cost function to include the joint computation of a subgraph.In Algorithm 2, the extended computation is shown.In the case of a single node or edge, the algorithm behaves the same as before.If a compressed subgraph should be evaluated, the maximum of all individual transaction or computation times inside the subgraph is taken and added to the time of each device that is involved into the streaming operation.for all x ∈ TopologicalSorting do 3: time(p i ), time(p j ) ← max(time(p i ), time(p j )) + d u out ,v in 9: else if x = ( Ṽ′ , E ′ ) then 10: time(p i ) ← max j (time(p j )) + SubgraphTime for all v i ∈ V′ While the modification of the cost function is quite straightforward, determining a maximal compressable subgraph is more demanding.We illustrate this statement with a short example.Assume the original graph consists of n tasks which are connected as a linear list.Furthermore, let there be one edge between the source and the sink of the graph.Then, assuming all of these graphs are eligible for streaming, for each task beside the source we need information about all other tasks to decide whether the task can be added to a compressable subgraph containing the source.If any of these tasks cannot be added, for example, due to an additional outgoing edge, none of the tasks can be added.Algorithm 3 reflects this behavior by managing a set of pending tasks and edges, which are revisited at the end to check which of these elements actually belongs to the new subgraph.
In the first part of the algorithm, the wavefront, that is, the set of tasks and edges which are at the border of the expanding (potential) subgraph is driven through the graph until it hits an element that cannot be part of the compressable subgraph due to general incompatibility or unresolved dependencies.As soon as such an element is found, no element with larger index in the topological sorting can be added to the subgraph.In the second part of the algorithm, the subgraph is reduced by removing all edges whose endpoint is not part of the subgraph and, in turn, all tasks and edges that have a higher index than the removed edge.Note that it is sufficient to iterate over all edges, since tasks can only be invalidated if there exists an edge dependency with smaller index.Furthermore, it is sufficient to check all edges for which the index of the end point is larger than the maximum index in the pending subgraph.If the index is smaller, the node itself must have been checked during the first part of the algorithm and be part of the pending subgraph.
An example for the wavefront propagation is shown in Figure 7.After the 14th element is processed, the wavefront consists of two edges, at position 15 and 16, and two tasks, at position 17 and 18.The task at position 17 is not compatible with the subgraph and hence will conclude the first part of the algorithm before the task at position 18 can be added to the pending elements.Note that the task at position 14 is already pending, but will not be part of the maximal compressible subgraph, since it is preceded by the edge at position 12.
A maximal compressable subgraph must not be unique, even with respect to a topological base ṽ.Assume there are four nodes {a, b, c, d} with edges {(a, c), (b, d)}, such that a, b and c are generally compatible and d is not.If the topological order of these nodes is a, b, d, c, then either the subgraph consisting of a and b or the subgraph consisting of a and c are maximal with base a.The subgraph selection becomes unique if, in addition, the subgraph must be connected.Let a, b be two tasks in a maximal compressable subgraph with topological base a.If b is connected to a, every node and edge that is connected to a with index between a and b must be in the subgraph as well.Now let there be another connected compressable subgraph with topological base a, but without b, then it cannot contain a task or edge with index greater than the index of b.Hence it must be a subset of the original subgraph.Since b was chosen arbitrarily, the maximal connected compressable subgraph with base a is unique.
Although uniqueness is generally desirable, we decided against a connectedness requirement for compressable subgraphs in our model.Allowing unconnected subgraphs enables the modelling of hardware parallelism on the FPGA, which can be desirable when well-streamable tasks lie in different parts of the original graph.Aside from the change to the cost function, streamability can also affect the computation cost of single tasks.
Bigger tasks that are streamable and fit on a single FPGA may greatly benefit from pipelined processing.Regarding Section 4.3, the behavior can be Algorithm 3. Algorithm to find a maximal p-compressable subgraph for a device p with topological base ṽ.Starting from ṽ, a so-called wavefront, consisting of tasks and edges, is driven through the graph until one element is found that is incompatible with the current subgraph.This element is not removed from the wavefront and causes the loop to end due to the condition in line 8. Compatible tasks and edges are added to a list of pending elements in line 13 and line 16, respectively.These elements are potential candidates for the maximal p-compressable subgraph.Finally, in line 21 to 26, the last index of elements in the potential subgraph is determined, where no outgoing edge with smaller index exists that leads out of the covered index range.All elements with a higher index are removed from the pending elements in line 27 1: function FINDMAXIMALSUBGRAPH(TopologicalSorting, Mapping, ṽ, p)

27:
Remove all x from Pending where Index(x)>LastIndex 28: return Pending 29: end function modeled by a streamability factor for each task, indicating into how many equal-sized pipelined steps the task can be split.With this addition to the model, if a computation node is mapped onto an FPGA and does not violate the area limitations, the execution time can be reduced by this factor.

MIXED-INTEGER LINEAR PROGRAMS
The abstract model presented in Section 4.1 allows us to effectively develop and compare algorithms and heuristics for heterogeneous task assignments without regard for implementation details.In this section, we present two mixed-integer linear programs for heterogeneous task assignment based on the model.

Device-based ILP
In the first MILP, we aim to minimize the maximum time on each device.Consider a system with memory augmented task graph G = ( Ṽ, E) and devices D. For a node v ∈ V, let t vp = t(v, p) be the time required to execute node v on device p and let d vpq be the time required to transport the output data Example for a wavefront as used in Algorithm 3. Circle colors indicate the mapping and labels indicate the topological order.The topological base is marked with a green label (task 4), the pending elements are marked in orange and the wavefront is indicated by red labels.The opaque orange-labelled elements (1, 2, 3) have been processed in the initialization step.The opaque black-labelled elements (6, 7) were ignored by the algorithm.
of node v from device p to device q (see Definition 3 in Section 4.2).In the following, we use the naming convention ṽ = (v i , v c , v o ) for tasks ṽ ∈ Ṽ.
Let x vp be a binary variable indicating that node v is mapped to device p, and let )|ṽ ∈ Ṽ} be the extended set of edges in the application graph.Then the times T p , T in p , T out p reflecting the total time of execution on, transport to, and transport from device p, respectively, are given as: Although many of these terms are not needed, for example, terms which are known to evaluate to zero due to missing compatibility or zero data transfer time, we keep the more general formula for the sake of simplicity.For practical usage, the superfluous terms should be removed.Note that the quadratic terms x up x vq can be replaced by single variables using the McCormick inequalities x upvq ≤ x up , x upvq ≤ x vq and x upvq + 1 ≥ x up + x vq , hence the program can still be considered linear.Our goal is to minimize the term max p (T p + T in p + T out p ).To resolve the minmax formulation, we introduce another variable z with z ≥ T p + T in p + T out p for all p ∈ D, which is then minimized.As additional constraint we ensure that each node is mapped to one device.Let c vp = c(v, p) indicate the compatibility of node v and device p as introduced in Definition 2. Then we want to guarantee that ∑ p∈D c vp ⋅ x vp = 1 for all v ∈ V. Hence our final MILP is given as The primary intention of this model is to recognize parallelization opportunities without ignoring the impact of the heterogeneity on the data transfer between adjacent tasks.Existing ILPs for similar problems either do not consider data transfer or do not consider parallelization.The given ILP demonstrates how these factors can be combined when memory nodes are included in the mapping.However, the model still propagates a rather local view on the task graph and hence cannot guarantee that tasks that are expected to be executed in parallel are actually independent from each other.In the following section, we change the perspective to a time-centered rather than device-centered approach.

Time-based ILP
While above MILP is reasonably simple, it does not consider execution order and synchronization issues.In this section, we present a more exact, but also more expensive time-based linear program.Here, the goal is to "simulate" an execution, that is, to assign start and end times to each task.For this, we introduce variables y ṽ,0 , y ṽ,1 representing the start and end of the execution of task ṽ, including the internal memory transfer.
Using the same notation as in the previous section, we guarantee that there is sufficient time to transfer the data from and to the processing device and to execute the task between the start and the end time of a task.Hence, we demand that the end time of a task is larger than the start time plus the time for computation and internal memory transfer.The resulting constraint is given as for all tasks ṽ ∈ Ṽ.Furthermore, we assure that a task can only be started if all previous tasks have been processed and their output data was transferred.Hence, for all edges (ũ, ṽ) ∈ E. In contrast to the device-based variant, we must assure that each device is used for only one task simultaneously.For this, we sort the tasks topologically and demand that all tasks that are mapped onto the same device are executed in topological order.While this can create undesirable virtual dependencies, especially regarding memory transfer, the detrimental impact can be mitigated with a well-behaved topological sorting.A sorting based on a breadth-first search is usually advisable, since with similarly sized tasks, the probability that a preceding task is not ready for a long time is reasonably small.More elaborate sorting strategies, for example, based on the task size or complexity, can reduce this probability even more.In the linear program, this constraint can be formulated as for all ũ, ṽ ∈ Ṽ with Idx(ũ) < Idx(ṽ) and all u ∈ ũ, v ∈ ṽ.This equation can be linearized by replacing it with y ṽ,0 − y ũ,1 ≥ M ⋅ x up ⋅ x vp − M for all p with a sufficiently large constant M and using the McCormick inequalities as before.We minimize the maximum time z by demanding z ≥ y ṽ,1 for all tasks ṽ ∈ Ṽ. Adding, as before, the condition that a device must be assigned to each task node, we get the final model as ) , ∀ṽ ∈ Ṽ, In this model, many aspects of an actual execution are represented.In particular, a transitive order requirement is established, through which the program is able to decide which paths exist in the task graph and, in consequence, which of the paths have the most impact on the overall time.
Furthermore, the algorithm can effectively decide which tasks can be executed in parallel.Memory contention is partly incorporated into the linear program since data transactions are always associated with tasks.However, in general the program cannot decide whether two data transfers in different subgraphs happen simultaneously, since each node only derives information about memory transfer cost from its ancestors.

Extension: Streaming devices
The models presented in the previous sections are designed with control-flow-based processing devices in mind.In this section, we elaborate on the difference with respect to dataflow processing units and extend the time-based model to exploit the additional optimization options resulting from streamability.First, since devices such as FPGAs have a maximum capacity, we must ensure that the total number of tasks added to the device does not exceed this capacity.Let a ṽ be the area requirement for task ṽ and C p be the capacity of a device p. Then where D s is the set of all streaming devices.We add this constraint to both the device-based and the time-based approach to ensure a valid mapping.
In the device-based approach, the pipelining capability cannot be represented since there is no concept of execution order implemented into the model.The time-based linear program, however, can be extended to reflect the pipelining behavior as it is modeled in Section 4.5.For this, we modify the constraints to 1. reflect the pipelining behavior during a task execution by taking the maximum of execution time and internal memory transfer times, and 2. enable tasks on streaming devices to start simultaneously with a parent task executed on the same device.
Let D = D s ⋅ ∪ D t be the set of all devices, where D s is the set of all streaming devices and D t its complement.Then we split the task computation time into and replace the task computation constraint by to account for the nonstreaming devices and y ṽ,1 ≥ y ṽ,0 + ∑ s∈D s T ṽ,s and y ṽ,1 ≥ y ṽ,0 + ∑ s∈D s T in ṽ,s and y ṽ,1 ≥ y ṽ,0 + ∑ to account for the streaming devices.For computation devices without streaming capability, the constraint is equivalent to Equation 1 presented in Section 5.2.For streaming devices, the constraint is split up into three parts such that only the maximum time of the transfer from memory, the computation and the transfer back into memory is required as the difference between start and end time of the task execution.
For the incorporation of inter-task streaming, we want to assure that for consecutive streamable tasks only the maximum execution time of both tasks is required for their overall execution.Hence, we aim to effectively reduce the second constraint (Equation 2) of the time-based MILP to y ṽ,0 ≥ y ũ,0 if both tasks are executed on the same dataflow-based device.We achieve this by setting the inter-task transfer cost in relation to the used processing devices on the two tasks.For all edges (ũ, ṽ) ∈ E we set For streaming devices the constraint now just states y ṽ,0 ≥ 0. To account for the inter-task transfer cost inside a streaming subgraph, we instead require the end time of task ṽ to be at least the start time of the previous task plus the transfer cost.
By additionally requiring y ṽ,0 ≥ y ũ,0 and y ṽ,1 ≥ y ũ,1 for all edges (ũ, ṽ) ∈ E, the end time of task ṽ is guaranteed to be the maximum of the computation time of ũ, the transfer time between ũ and ṽ and the computation time of ṽ.We linearize Equation ( 4) in two steps.In order to resolve the conditional inclusion of the end time y ũ,1 of task ũ, we introduce a new variable y ṽ ũ,1 , which is forced to y ũ,1 only if no streaming occurs between task ũ and task ṽ.With this, we set For this constraint to be equal to Equation (4), it must be guaranteed that which can be linearized as for a large value M, provided y ṽ ũ,1 ≥ 0. To linearize the second part of Equation ( 4), variables x (1) ṽsp , x (2) ṽsp can be introduced for ṽ ∈ Ṽ, s ∈ D s and p ∈ D with x (1) ṽsp = x v c s ⋅ x v i p and x (2) ṽsp = x v c s ⋅ x v o p .With these additions the equation can be reformulated as Naturally, the same linearization can be applied to Equation (5).Finally, we modify Equation (3) to only restrict same-time use for nonstreaming devices.Similar to before, we impose a condition on the constraint by multiplication with a streaming indicator expression.
When substituting u ∈ ũ and v ∈ ṽ, we get five constraints, one for processing devices and four for the various memory combinations.All of them can be linearized analogously to the proposed linearization in Section 5.2 using the helper variables defined above.Note that for the processing device, we can simplify the constraints by iterating exclusively over nonstreaming devices.

EVALUATION
We demonstrate the usage of the model in an early design stage through several experiments in a sample environment.We determine the execution time and data transfer time based on the specifications of the given devices and the size of a virtual data load as described in Section 4. For the application, we generate random series-parallel graphs with 20 nodes.For this, we start with a connected source and sink node and subsequently add edges using either a series (split an edge into two by adding a node on it) or parallel (copy an edge) operation.The resulting graphs are stereotypical for data-intensive applications where, starting with a common data set, the data is processed along different computation paths before their outputs are combined to a common result.During the generation we keep track of duplicate edges and delete them as soon as 20 nodes are generated.After removing duplicate edges, we arrive at graphs with 20 nodes and, on average, around 25 edges.Each of these nodes is then converted to a task with an input, a computation and an output node, resulting in application graphs with, in total, 60 processing and memory nodes.
We assign the same data load to each task, so the data processing function of each task is the identity function.We choose the parallelizability p of a task uniformly between 0 and 1 and the complexity function as a linear function f(x) = cx, where the factor c is log-normal-distributed with  = 3,  = 0.5.The parameters are chosen to create generally similar complexities with occasional outliers of significantly higher complexity.About 90 % of the generated values for c lie in the interval [10, 50] with a median of about 20.For the FPGA extension, we assume that every task is streamable and that the area needed for a task is equal to its complexity factor.The streamability factor s, that is, the possible gain through streaming, is generated randomly according to the same distribution as the complexity factor.Through this, one used unit of area is generally equated to roughly one pipelining step.The linear programs are solved using Gurobi 9.1.2 29in C++20 with g++ 9.4.0 on an AMD EPYC 7542 with 2 TB RAM.
In Section 6.1, we demonstrate the usage of the model for three different platforms.We evaluate the presented ILPs and demonstrate the differences based on selected mapping results.In addition, in Section 6.2, we use an OpenCL CPU-GPU framework to compare the predictions by the cost function with actual runtimes.

Model-based evaluation of the ILPs
Using our model, we compare the mapping results of the three presented linear programs for three different hardware configurations: A configuration with only CPU and GPU, a configuration with CPU, GPU and one FPGA and a configuration with CPU, GPU and two identical FPGAs.Table 1 shows the average, minimum and maximum change of performance compared to an implementation where all tasks are mapped to the CPU.For a pure CPU-GPU environment, the time-based approach is able to improve 85% of the graphs.However, the average runtime improvement is only at about 2%.For our input data, mapping all tasks to the GPU makes the execution about 50% slower.Compared to the CPU, the higher parallelization factor of the GPU leads to an improvement only if close to 100% of the task is parallelizable.Consequently, potential improvements through the GPU are mainly enabled by the simultaneous execution of different tasks using uncontended memories.
As the results show, the time-based ILP is usually more effective than the device-based ILP in increasing the performance of the execution.Both the maximum performance gain and the total number of improved mappings is higher for the time-based ILP.Adding one or two FPGAs increases the size of the design space and consequently leads to more optimization opportunities, showing potential performance gains of up to 70%.The streaming extension for the time-based approach behaves identical to the nonstreaming variant in the CPU-GPU configuration.As soon as FPGAs are available, the streaming variant consistently improves upon the standard ILP.In the 100 runs, the streaming-based approach leads to the best mapping with respect to the output of all other algorithms in 78 cases for one FPGA and 84 cases for two FPGAs, respectively.However, the additional gain comes at the cost of a significantly higher execution time, especially if more than one streaming device is available.An exemplary mapping of the three algorithms is shown in Figure 8.All of the depicted mappings improve on a pure, nonheterogeneous CPU mapping, which has a cost of 106 s compared to 55 s to 85 s as indicated in the final nodes of the task graphs.The device-based approach tries to balance out the computation time between the three available devices.For this, it tends to put well-parallelizable tasks on the GPU and badly parallelizable nodes with good TA B L E 1 Performance gain through task assignment strategies compared to assigning all tasks to the CPU for 100 graphs with 20 tasks.The fourth column indicates the number of cases in which the performance could be improved.The execution time of the optimization algorithm is given in the last column.streamability on the FPGA.However, it is not able to recognize which of these nodes lie on the critical path of the task graph and therefore fails to ensure that the best devices are used for the associated tasks.The time-based approach is able to identify the critical path and therefore uses the better performing CPU on the last two nodes before the sink.Furthermore, it utilizes the FPGA for the first node after the source.By this, the transaction time to the uncritical GPU path on the left is increased in order to reduce the cost of the critical path and therefore the overall execution time.However, there are cases in which the device-based ILP finds a better mapping, since it is not restricted to follow a specific topological order.Furthermore, it is less complex to solve and therefore may be better suitable for very large task graphs.
With the streaming extension, the time-based approach can not only utilize FPGAs as single task performance accelerator, but also to accelerate subsequent tasks.In the mapping shown in Figure 8B, a CPU task blocks the streaming capabilities of its surrounding FPGA tasks.The streaming approach in Figure 8C is able to recognize this situation and instead shifts the CPU to the independent path on the left, replacing the GPU.With this, all FPGA nodes can be executed in parallel, leading to a significantly reduced overall execution time.
In the example shown in Figure 8, the transfer cost between different memories has only a small impact on the mapping.This changes drastically if the complexity of the computations is reduced.When reducing the complexity by one order of magnitude for all tasks, switching devices is much more costly compared to the computation itself.In this case, in each of the hardware configurations, only about 30 to 40 out of 100 graphs with 20 tasks could be improved using the time-based algorithm and only up to 6 out of 100 graphs with the device-based ILP.Furthermore, the tendency to map multiple connected tasks to the same device strongly increases.In the best case, improvements of up to 20% compared to a CPU implementation are reached.Since in our model, the area requirement is coupled to the complexity of a task, for small graphs all tasks can be mapped to the FPGA.
The streaming extension for the time-based approach recognizes this, leading to a significant time reduction in almost all cases.When keeping the original area requirements, the ILP with streaming extension behaves similarly to the standard time-based ILP.show strong variation even when averaging over 10 graphs.

Runtime evaluation in OpenCL
In order to compare the predictions made by our model with the real-world behavior of a heterogeneous system, we implemented an OpenCL 2.2 CPU-GPU framework that allows for the execution of arbitrary application graphs given a mapping and respective kernel functions.Our aim is to evaluate whether the abstract model and cost function presented in Section 4.2 reflects the behavior of a real system.Finding a well-suited realization of the model for a specific environment, as motivated in Section 4.3, is a challenging problem on its own, but orthogonal to the before mentioned question.We furthermore refrain from a practical implementation of an FPGA system, since this would require extensive discussion, which goes beyond the scope of this paper.Hence, both finding a generalized approach to creating meaningful realizations and the thorough evaluation of the FPGA part of the model are subject to future research.
For the test system, we create dummy kernels performing a varying number of additions in the finite group Z 47 , that is, additions followed by a modulo operation (see Algorithm 4).One complexity unit corresponds to 100 operations.Our input consists of an array of 2 18 unsigned integer values, resulting in a total input size of 1 MB.Generally, we assign one work item to each array value, making the kernel fully parallelizable.We simulate different degrees of parallelizability by explicitly forcing a certain percentage of the operations to be executed on the whole array by a single workitem.Note that in real applications, the parallelizability of a task is significantly harder to determine.However, while dealing with these issues requires a more complex realization approach, the simplification does not affect the validity of the results with respect to the abstract model.
As described in the previous section, the GPU will only improve on the CPU if the parallelizability of a task is close to 100%.In order to get more meaningful mappings, we modify the generation of random graphs such that with a probability of 50% the generated task is fully parallelizable, while in the other cases the parallelizability is chosen uniformly at random.For the realization we use the model from the previous section and refine it through measurements similar to the idea of a utility matrix described at the end of Section 4.3.We experimentally observed that one iteration of the kernel loop takes about 9 clock cycles per data point on the CPU and about 22 cycles on the GPU, leading to a factor of or 2200, respectively, per complexity unit.
The results of the evaluation of 100 randomly generated graphs with 20 tasks are shown in Table 2.The average absolute deviation between the predicted and the measured execution values is at around 10%.In the presented data, the model tends to underestimate the efficiency of the mappings found by the two ILPs.For the development of efficient algorithms, it is especially important to predict whether a certain mapping is better or worse than another one.The last column shows the number of cases in which the found mapping was either better than the pure CPU mapping for both the predicted and the measured execution time or worse in both cases, that is, where the predicted order matches the actual order.With 81% correct predictions in the case of the device-based ILP and 73% correct predictions in the case of the time-based ILP, the model is already feasible for a rough comparison.However, the prediction can still be improved.
There are two factors that lead to systematical errors in the prediction.First, the error rate for predicting the pure CPU mapping is relatively high with about 9.5%.This is caused by sporadic background processes on the test system, leading to thread switches that cause a constant delay of about 100 ms for each task.To correct for that we run each process three times and take the minimum value of these runs.Second, most large deviations are caused by a suboptimal predicted execution order.The test program enqueues each task as soon as all of its dependencies are fulfilled, whereas in the model we assume a topological order based on a breadth-first-search.As suggested in Section 4.2, we adapt to this by choosing the topological order partly at random.In addition to the breadth-first-search topological order, we randomly generate 100 topological orders.We then TA B L E 2 Comparison of predicted and measured runtimes for the mappings found by the device-based and the time-based ILP.The results are based on 100 randomly generated task graphs with 20 nodes each.Improvements are computed in relation to a pure CPU mapping.Generally, the predictions underestimate the improvements made by the algorithms.The average deviation is computed as the arithmetic mean of the absolute values of the relative deviations.The correct predictions show the number of cases in which it was correctly predicted whether the found mapping improved on a pure CPU mapping.F I G U R E 10 Comparison of the predicted and actual timing of a graph with 10 task nodes and a mapping derived by the device-based ILP.The prediction underestimates the actual runtime since it uses a better scheduling than the actual execution in the test system.take the minimum cost of all computed orders as our prediction.Even without controlling for duplicates this approach is reasonable due to the linear complexity of the cost function, taking about 2 ms on our test system.
Table 3 contains the measurements taken with the two improvements described above.The error rate for a pure CPU mapping dropped to around 3%, leading to a significantly better reference for comparison.By improving the ordering, especially the gains of the time-based approach could be captured much better, increasing the number of correct predictions from 73% to 87%.Note that, while before the prediction tended to underestimate the gains of the new mappings, it now tends to overestimate the gains.The reason for this is that the randomized approach tends to find even better execution orders than the greedy first-come, first-serve order enforced by the test system.This observation hints at another advantage of the model-based evaluation.Depending on the application, the fast evaluation algorithm enables developers to rapidly search through many potential schedules to decide which of them leads to the lowest execution time.In Figure 10, one of these cases is depicted.Here, the prediction uses a topological order in which the two GPU nodes of the right path are evaluated first, hence enabling an early execution of the expensive node that is mapped to the CPU.In the actual execution, two nodes on the left path are additionally executed before the CPU node, thereby delaying the execution of the critical path by about 450 ms.
TA B L E 3 Comparison of predicted and measured runtimes for the mappings found by the presented ILPs, analogously to Table 2, with two modifications.First, the predicted runtimes are taken as the minimum of the predictions based on 100 random topological orders and the prediction based on a breadth-first-search topological order.Second, the measured runtimes are taken as the minimum over three runs.The predictions slightly overestimate the improvements.The average deviation is significantly lower, while the number of correct predictions is significantly higher than for the results from Table 2. F I G U R E 11 Prediction rate and average deviation for different graph sizes.The test settings are identical to the settings used in Table 3.The prediction rates lie between 86% and 95%, whereas the measured average deviations lie between 4% and 10%.

Average
The results shown above for 20 task nodes are generally representative for other graph sizes.Figure 11, prediction rates and deviations are shown for graph sizes between 5 and 30 nodes.For both the prediction rate and the average deviation, the results are consistent over various graph sizes.A clear correlation between the size of the task graph and the measured values is not observable, although more data points would be required for a full evaluation of the behavior.
All tests were executed with enforced sequential execution of the OpenCL queues.If parallel execution is enabled, the CPU can make use of multiple cores while executing multiple tasks that are badly parallelizable.The evaluated model does not reflect this behavior, which may lead to significant deviations if an application consists of many independent, but inherently sequential, tasks.With our test data, enabling out of order execution leads to an average deviation of about 33% for a CPU implementation as well as to an average deviation of 7% and 13% for the device-based and time-based approach, respectively.Since the additional error mostly depends on the structure of the application graph, it tends to affect all three mappings proportionally.Consequently, the prediction rate is still high with around 80% for both ILPs.As part of future work, to incorporate this behavior into the model, a similar technique as described for databus usage in Section 4.4 may be used to reflect partial utilization of the CPU.

CONCLUSION
The model presented in this work provides a solid basis for the development and the abstract analysis of general task mapping algorithms.A common model allows the designer to use various heuristics to explore the design space for potential improvements early in the design process.In particular, a large database of available algorithms helps in deciding early on whether a potential optimization is worth the effort.The special focus on dataflow processing units makes it especially viable for hardware/software co-design, for example in the context of heterogeneous MPSoCs.
The realization of the model in different design stages currently still puts much responsibility to the designer.The modeling of the computation times and data transfer times assessed in Section 4. While the used breadth-first search did lead to good results, large differences in task complexities may result in unexpected behavior.

Algorithm 2 .
Computation of the total cost of a given task-device mapping under consideration of streaming behavior.It extends Algorithm 1 by a third case, in which x represents a subgraph (line 9 to 12).Here, the time values of the respective devices are increased by the computation cost of the whole subgraph 1: function DETERMINECOST(TopologicalSorting, Mapping) 2: return max({time(p) | p ∈ Devices}) 16: end function

Figure 9
Figure 9 gives an impression on the complexity of the linear programs.While the time-based ILPs show fast exponential growth, the device-based mapping remains significantly longer feasible.The execution may vary significantly based on the graph structure, hence the results

8
Mappings found for a small sample graph by the device-based and the time-based ILP with and without streaming extension.The first two lines in each node indicate the computed mapping.If input and output memory are identical, only one is shown.The corresponding memory nodes are omitted for readability.At the edges and in the nodes, the time windows for transport and computation are annotated.Furthermore, in each node the parallelizability p, the complexity factor c and the streamability s are given.
3 and used in our experiments provides a direction on how the model can be used.More precise realizations can be easily integrated into the presented model structure.Their development is open for future research.A central research question with respect to the presented cost function lies in the evaluation of different topological sortings.
+ 1, ..., n) do else if x = (u in , u c , u out ) ∈ Ṽ and s(x) = 1 and M(u c ) = p and x is not marked then 13: Remove x from Wavefront, add x to Pending 14: Resolve dependencies of x and add outgoing edges of x to Wavefront 15: else if x = (ũ 1 , ũ2 ) ∈ E and Index(x) = Wavefront.MinIndex() then 16: Remove x from Wavefront, add x to Pending 17: Resolve dependencies of x and add ũ2 to wavefront 18:Save (Index(ũ 2 ), Index(x)) in EdgeRanges 3. Our test system contains an AMD Epyc 7351P with 16 cores (32 threads), a clock rate of 2.9GHz and SIMD processing with 8 × 32 B words, as well as an AMD Radeon RX Vega 56 with 1.5 GHz and 3584 SIMD units.Furthermore we assume a Xilinx XCZ7045 FPGA with a clock rate of 400 MHz and an equivalent of 350 k logic cells, partitioned into 128 area units.We assume appropriate RAM units for CPU, GPU and FPGA with a calculated throughput of 170, 410, and 11 GB∕s, respectively.

9
Execution the three ILPs for a configuration with one CPU, one GPU and one FPGA and varying graph size.Results are averaged over 10 graphs.The execution time is shown on a logarithmic scale.Exemplary test kernel as generated for a task with a complexity of 1, a parallelizability of 60% and two incoming edges.Lines 9 to 16 contain the parallelizable part of the kernel, whereas lines 21 to 32 contain the serial part.The parallelizability and complexity associated with a kernel affect the number of executions of the innermost for loops.For each input parameter, one group operation is executed in the loop body.In each of the two parts, the input and output data is read or written, respectively, only once