A generic parallel pattern interface for stream and data processing

Current parallel programming frameworks aid developers to a great extent in implementing applications that exploit parallel hardware resources. Nevertheless, developers require additional expertise to properly use and tune them to operate efficiently on specific parallel platforms. On the other hand, porting applications between different parallel programming models and platforms is not straightforward and demands considerable efforts and specific knowledge. Apart from that, the lack of high‐level parallel pattern abstractions, in those frameworks, further increases the complexity in developing parallel applications. To pave the way in this direction, this paper proposes GRPPI, a generic and reusable parallel pattern interface for both stream processing and data‐intensive C++ applications. GRPPI accommodates a layer between developers and existing parallel programming frameworks targeting multi‐core processors, such as C++ threads, OpenMP and Intel TBB, and accelerators, as CUDA Thrust. Furthermore, thanks to its high‐level C++ application programming interface and pattern composability features, GRPPI allows users to easily expose parallelism via standalone patterns or patterns compositions matching in sequential applications. We evaluate this interface using an image processing use case and demonstrate its benefits from the usability, flexibility, and performance points of view. Furthermore, we analyze the impact of using stream and data pattern compositions on CPUs, GPUs and heterogeneous configurations.

parallel programming frameworks targeted to multi-core and heterogeneous platforms. Basically, GRPPI allows users to implement parallel applications without having a deep understanding of existing parallel programming frameworks or third-party interfaces and, thus, relieves the development and maintainability efforts. In contrast to other object-oriented implementations in the literature, we use C++ template meta-programming techniques in order to provide generic interfaces of the patterns without incurring in significant runtime overheads. Specifically, we contribute in this paper with the following: • We present a generic, reusable set of parallel pattern for the C++ language that interface with different parallel programming frameworks targeted to multi-core processors, such as C++ threads, OpenMP and Intel TBB, and accelerators, such as CUDA Thrust.
• We provide support for stream processing (Pipeline, Farm, Filter and Accumulator) and data (Map, Reduce, Stencil, MapReduce and Divide&Conquer) parallel patterns.
• We show the flexibility and the composability of GRPPI for both stream and data patterns, and their combination through diverse simple examples.
• We evaluate the overheads introduced by the interface using a real-world image processing application with regard to other pattern-based parallel frameworks and runtime environments.
• We analyze the impact of using different stream and data pattern compositions on CPU, GPUs and heterogeneous configurations on the aforementioned use case.
In general, this paper extends the results presented in 9 with i) the inclusion of data parallel patterns in GRPPI, ii) the support for accelerators, and iii) the pattern composition analysis along with its evaluation on both multi-core and heterogeneous platforms.
The remainder of this paper is organized as follows. Section 2 gives a brief overview of related works in the area. Section 3 states the formal definition of the stream and data parallel patterns supported by the interface. Section 4 describes the generic parallel pattern interface presented in this contribution. Section 5 evaluates overheads of the interface on several parallel programming frameworks and analyzes the pattern composition benefits on different platform configurations. Section 6 provides a few concluding remarks and outlines future works.

RELATED WORK
Multiple works proposing patterns for developing applications targeted to run on modern architectures can be found in the state-of-the-art. Indeed, pattern programming has become one of the best codifying practices in software engineering. 10 The reason is clear: they simplify the application structure while achieve a good balance between maintainability and portability. In this sense, one of the most common ways to express parallelism are parallel skeletons or patterns. 11 These patterns can be classified in 2 main categories: data parallel, eg, Map, Reduce or MapReduce; and stream parallel patterns, eg, Pipeline, Farm or Filter. 12 Most of the existing pattern-based frameworks in the literature are data-parallel computing oriented. Focusing on implementations targeted to run on multi-core processors, we find solutions such as ArBB 13 and Kanga. 14 ArBB defines a collection of basic data classes and member functions to define data-parallel skeletons, which can be used with alternative front-ends. The Kanga framework also supports task-parallel skeletons; nevertheless, it lacks of stream-processing patterns. We can also find frameworks that implement data-parallel pat-

PARALLEL PATTERNS
Patterns can be loosely defined as commonly recurring strategies for dealing with particular problems. This methodology has been widely used in multiple areas, such as architecture, object-oriented programming, and software architecture. 20 In our case, we leverage patterns for parallel software design, as it has been recognized to be one of the best codifying practices. 10

Stream patterns
In this section, we describe formally the stream parallel patterns Pipeline, Farm, Filter, and Accumulator included in GRPPI.
Pipeline This pattern processes the items appearing on the input stream in several parallel stages (see Figure 1A). Each stage of this pattern processes data produced by the previous stage in the pipe and delivers results to the next one. Provided that the i-th stage in a n-staged Pipeline computes the function f i ∶ → , the Pipeline delivers the item x i to the output stream applying the The main requirement of this pattern is that the functions related to the stages should be pure, ie, they can be computed in parallel without side effects.
Farm This pattern computes in parallel the function f ∶ → over all the items appearing in the input stream (see Figure 1B). Thus, for each item x i on the input stream the Farm pattern delivers an item to the output stream as f(x i ). In this pattern, the computations performed by f for the items in the input stream should be completely independent to each other, otherwise they cannot be processed in parallel.
Filter This pattern computes in parallel a filter over the items appearing on the input stream, passing only to the output stream those items satisfying the boolean "filter" function (or predicate)  ∶ → {true, false} (see Figure 1C). Basically, the pattern receives a sequence of input items … , x i+1 , x i ,x i−1 , … and produces a sequence of output items of the same type but with different car-dinality. The evaluation of the filtering function on an input item should be independent to any other, ie, the predicate should be a pure function.
Accumulator This pattern collapses items appearing on the input stream and delivers these results to the output stream (see Figure 1D). The function used to collapse item values ⊕ should be a pure binary function of type ⊕ ∶ × → , being usually associative and commutative. Basically, the pattern computes the function ⊕ over a finite sequence of input items to produce a collapsed item on the output stream. The number of elements to be accumulated depends on the window size set as parameter.

Data patterns
In this section, we describe formally the data parallel patterns is that the function f should be pure. Reduce This data parallel pattern aggregates the elements of the input data collection of type using the binary function ⊕ ∶ × → , that is usually associative and commutative. Finally, the result of the pattern is summarized in a single element y of type that is obtained performing the operation where x i is the i-th data item of the input data collection (see Figure 2B). The main constraint of this pattern is that the binary function should be pure.
Stencil This pattern is a generalization of the Map pattern in which an elemental function can access, not only to a single element in an input collection, but also to a set of neighbors (see Figure 2C). The

A GENERIC AND REUSABLE PARALLEL PATTERN INTERFACE
In this section, we introduce our generic and reusable parallel pattern interface (GRPPI) for C++ applications. GRPPI takes full advantage of modern C++ features, metaprogramming concepts, and generic programming to act as switch between the parallel programming models OpenMP, C++ threads, Intel TBB and CUDA Thrust. Its design allows users to leverage the aforementioned execution frameworks just in a single and compact interface, hiding away the complexity behind the use of concurrency mechanisms. Furthermore, the modularity of GRPPI permits to easily integrate new patterns, while combining them to arrange more complex ones. Thanks to this property, GRPPI can be used to implement a wide range of existing stream-processing and data-intensive applications with relative small efforts, having as a result portable codes that can be executed on multiple frameworks.
Next, we describe in detail the interfaces of the parallel patterns offered by GRPPI and demonstrate its composability through different simple examples.

4.1
Description of the interfaces GRPPI offers both stream patterns and data patterns with a single interface carefully designed to allow composability and to support multiple implementation back-ends.

Stream patterns
The GRPPI stream parallel patterns includes the Pipeline, Farm, Filter, and Accumulator patterns.
Pipeline The GRPPI interface designed for the Pipeline pattern receives the execution model and the functions (in and stages) related to its stages. As can be seen in Listing 1, its C++ interface uses templates, making it more flexible and reusable for any data type. Note as well the use of variadic templates, allowing a Pipeline to have an arbitrary number of stages by receiving a collection of callable objects passed as arguments. In GRPPI, the parallel implementation of this pattern is performed using a set of concurrent entities, each of them taking care of a single stage. This is controlled via the execution model parameter, that can be set to operate in sequential or in parallel, through the different supported frameworks; eg, to use OpenMP, the parameter should be set to parallel_execution_omp.
Farm In a similar way, the Farm pattern interface, shown in Listing 2, receives the execution model and 3 functions (in, farm and out) that are in charge of (i) consuming the items from the input stream, (ii) processing them individually, and (iii) delivering the results to the output stream. Note that the farm function will be executed in parallel by the different concurrent entities.
In this case, the execution model can optionally receive, as an argument, the number of entities to be used for the parallel execution, eg, parallel_execution_omp{6} uses 6 OpenMP worker threads. If this argument is not given, the interface takes by default the number of threads set by the underlying platform.
Filter The interface for the Filter pattern, described in Listing 3, receives the execution model argument, followed by a stream consumer (in), filter (filter) and producer (out) functions.
Specifically, the in function reads items from the input stream and forwards them to the filter function, which is responsible to determine whether an item should be accepted or not. Afterwards, those items that satisfy the filtering routine, are received by the out function in order to deliver them to the output stream.
Note that it is mandatory that the filter function to return a boolean expression. The parallel implementation of this pattern applies the filter function using a set of concurrent entities, that can be configured in the execution model parameter.
Accumulator The Accumulator pattern aims at reducing, using a spe-

Data patterns
This section describes in detail the interfaces for the data parallel

Pattern composability
As mentioned in the introduction, the patterns offered by GRPPI can be composed among them to produce more complex structures and to match specific constructions present in both stream and data parallel applications. To demonstrate this feature, we describe 3 examples of pattern composability tackling each of the feasible combinations of computational paradigms (stream and data) supported by GRPPI interface: stream-stream, data-data, and stream-data compositions.
For the stream-stream pattern composability, the code in Listing 10 implements a Pipeline in which the second stage is a Farm pattern.
The As mentioned, we can also compose stream with data patterns. This is a feasible composition, given that the items coming from a stream can be processed themselves using a data parallel pattern. The opposite is however not feasible because the results generated in a data pattern cannot be transformed into streams and, therefore, processed using a stream processing approach. To illustrate a stream-data pattern com- Irreducible This category is a feasible composition providing a useful parallel pattern that cannot be simplified any further. Note that pattern compositions falling in this category are natively supported by GRPPI.

Useful-Reducible This category is a feasible composition implement-
ing a pattern composition that can be simplified further but that, in some cases, provides a clearer and a more readable code than its simpler equivalent.
As shown in T able 1a, the stream-stream pattern compositions involving a Pipeline and other pattern are classified as Irreducible Focusing on data-data compositions, as shown in

Inner pattern
Map Reduce Stencil MapReduce Divide&Conquer it cannot be combined with any other inner one. The reasons are the same as those for the Accumulator pattern in stream-stream compositions. Other compositions whose outer pattern is MapReduce or Divide&Conquer are classified as Feasible, as they can be implemented in GRPPI although do not bring any major advantage.
Finally, stream-data compositions are summarized in Table 1c. Compositions whose outer pattern is Pipeline or Farm are denoted as Irreducible. The combination of 2 distinct parallel paradigms (stream-data) makes these compositions unique and precludes them to be simpli-

EVALUATION
In this section, we perform an experimental evaluation of GRPPI in order to analyze its usability, in terms of lines of code, and its performance, in comparison with the different parallel execution environments currently supported. To do so, we use the following hardware and software components: a Pipeline composed of a Farm in its second stage ( p | f | p | p ); iii) a Pipeline composed of a Farm in its third stage ( p | p | f | p ); and iv) a Pipeline composed of 2 Farm patterns in the second and third stages Next, we also evaluate and compare the performance when using  Figure 3 illustrates some of the compositions used in these studies.

Analysis of the usability
In this section, we analyze the usability and flexibility of the developed interface. To analyze these aspects, we compare the number of lines required to implement the parallel version of the application leveraging the interface, with respect to using directly the parallel execution frameworks. Additionally, switching GRPPI to use a particular execution framework just needs changing a single argument in the pattern function calls.

Performance analysis of pattern compositions
Next, we analyze the performance with and without GRPPI using the different execution frameworks and Pipeline compositions for the video application. Concretely, we employ the frames per second (FPS) metric to analyze the behavior of the particular versions using a same input video with diverse resolutions. Also, we set the Farm stage(s) in all Pipeline compositions to be executed in parallel by 6 threads for all the execution models. Figure 4

Performance analysis of stream vs data patterns
Our next analysis compares the performance among Pipeline compositions that combine stream and data parallel patterns. Figure 5

Performance analysis on heterogeneous configurations
Our last experiment, analyzes the performance of a stream-data Pipeline composition with different heterogeneous configurations (CPU+GPU). Figure 6 illustrates the FPS delivered by the Pipeline composed of 2 Stencil stages that are mapped in different ways to the devices available on the platform. As a first observation, the mapping

CONCLUSIONS
In this paper, we have presented a generic and reusable parallel pattern interface, namely GRPPI, which leverages modern C++ features, metaprogramming concepts, and template-based programming to act as switch between parallel programming models. Its compact design facilitates the development of parallel applications, hiding away the complexity behind the use of concurrency mechanisms. In this version of the interface, we target both stream and data parallel patterns and demonstrate its flexibility composing them on a series of simple examples. We also support frameworks targeted to multi-core processors (C++ threads, OpenMP and Intel TBB) and accelerators (CUDA Thrust).
Therefore, given that many general purpose sequential applications can be decomposed in several design patterns, this interface can be easily introduced in such applications to parallelize them and improve their performance.
As observed throughout the evaluation with a parallel video application, the performance attained by each combination of parallel pattern using the supported frameworks directly with respect to using GRPPI, is almost the same. We prove as well that our approach does not lead to considerable overheads while permits to easily parallelize applications by adding, on average, 4.4 % of lines of code. We also demonstrate that, depending on the application and devices available, different stream-data compositions and frameworks may lead to better performance figures. In a nutshell, GRPPI advocates for a usable, simple, generic, and high-level parallel pattern interface, allowing users to implement parallel applications without having a deep understanding of existing parallel programming frameworks or third-party interfaces.
As future work, we plan to extend GRPPI for supporting more complex parallel patterns, such as windowed and keyed stream farms, stream iteration and the Stencil-Reduce 21 patterns. Furthermore, we intend to include other execution environments as for the offered parallel frameworks, eg, FastFlow, SkePU, and OpenCL SYCL. An ultimate goal is to incorporate scheduling techniques able to map task threads to CPU cores and GPUs and manage data transfers between host and device.