Designing and building application‐centric parallel memories

Memory bandwidth is a critical performance factor for many applications and architectures. Intuitively, a parallel memory could be a good solution for any bandwidth‐limited application, yet building application‐centric custom parallel memories remains a challenge. In this work, we present a comprehensive approach to tackle this challenge and demonstrate how to systematically design and implement application‐centric parallel memories. Specifically, our approach (1) analyzes the application memory access traces to extract parallel accesses, (2) configures our parallel memory for maximum performance, and (3) builds the actual application‐centric memory system. We further provide a simple performance prediction model for the constructed memory system. We evaluate our approach with two sets of experiments. First, we demonstrate how our parallel memories provide performance benefits for a broad range of memory access patterns. Second, we prove the feasibility of our approach and validate our performance model by implementing and benchmarking the designed parallel memories using FPGA hardware and a sparse version of the STREAM benchmark.

Customizing parallel memories. Our research focuses on the mapping of the access trace from the application to the parallel access patterns of the parallel memory we take the validation one step further and benchmark the implementation of the parallel memories designed and built for 19 Sparse STREAM instances: the original (dense) and 18 variants with various sparsity levels (Section 6). We demonstrate how our approach enables a seamless analysis and implementation of these 19 accelerators in hardware (using a Maxeler FPGA board). Our performance measurements (in terms of bandwidth) confirm (1) the expected performance gain in terms of speedup and (2) the accuracy of the model-based estimates for the achieved bandwidth.
In summary, our contribution in this paper, which extends our previous works, 2,3 is five-fold.
• We present a methodology to analyze and transform application access traces into a (close-to-) optimal sequence of parallel-memory accesses.
• We provide a systematic approach to optimally configure a polymorphic parallel memory (eg, PolyMem) and schedule the set of memory accesses to maximize the performance of the resulting memory system.
• We provide a statistical analysis to determine the performance potential that our parallel memory offers to applications with various memory access patterns.
• We define and validate a model that predicts the performance of our parallel-memory system.
• We present empirical evidence that the designs generated using our approach can be implemented in hardware as parallel-memory accelerators, delivering the predicted performance.

Parallel memories
Parallel memory.
A parallel memory (PM) is a memory that enables the access to multiple data elements in parallel.
A parallel memory can be implemented using a set of independent memories-referred to as sequential memories or memory banks. The width of the parallel memory is equal to the number of banks used and determines the number of elements that can be accessed in parallel. The capacity of the PM indicates the amount of data that it can store.
A location in a PM is a combination of a memory bank identifier and an in-memory address, which specifies where within a bank a data element is to be found/stored. Formally,the parallel-memory location of a data element is loc This work addresses non-redundant parallel memories. These PMs rely on bijective functions to map the coordinates of an application element to its memory location. Non-redundant PMs guarantee data consistency and the complete use of the capacity of their banks by avoiding data replication. However, they restrict the possible parallel accesses: only elements located in different banks can be accessed in parallel (see Section 2.2).

The application
The term application refers to the entity using the PM to access data, eg, a hardware component connected to the PM or a software application interfaced with the PM.
Without loss of generality, we will consider the data of an application to be stored in an array A of N dimensions. Each data element can then where P is the number of elements that the application can access concurrently at a given step of the computation. An application memory access trace is a temporal series of concurrent accesses. Finally, a parallel-memory access is an access to multiple data elements which actually happens in parallel.
Ideally, to maximize the performance of an application, any concurrent access should be performed as a parallel access, happening in one memory cycle. However, when the size of a concurrent access (P) is larger than the width of the PM (M), a scheduling step is required, to schedule all P accesses on the M memories. Our goal is to systematically minimize the number of parallel accesses for each concurrent access in the application trace. We do so by tweaking both the memory configuration and the scheduling itself. To map the access to an element in application space to a parallel access in PM space, we need to define a mapping function that guarantees M-wide conflict-free accesses. Determining the function to use is a key challenge in defining a custom parallel memory.

Parallel-memory configuration
Memory mapping function.
The Memory Mapping Function (MMF) maps an application memory access to its parallel-memory location, ie, its address on a given data bank: are the coordinates of the access in the application space, M is the width of the parallel memory, and D[I] are the sizes of each dimension of the application space array.
We note that due to the restriction that only conflict-free accesses can be parallel accesses, there is a limited set of access patterns that a parallel memory can support. These patterns are an immediate consequence of the MMF.

PM configuration.
A PM configuration is the pair (MMF, C), where MMF is a mapping function and C is the capacity of the PM.
Customizing a parallel memory entails finding, for a given application, the PM configuration that is able to retrieve the elements at the position specified in the application memory access trace using the minimum number of conflict-free parallel accesses, hence maximizing the bandwidth of the memory system. In the remainder of this paper, we focus on a methodology to customize parallel memories with the right M, C, and MMF for a given application, thus building application-centric parallel memories (see Section 3 and further).

Scheduling concurrent accesses
Once the parallel-memory configuration is known, a transformation between the application concurrent accesses and the memory parallel accesses is necessary. We call this transformation scheduling, and note it can be static, ie, computed pre-runtime, per concurrent access, or dynamic, ie, computed at runtime. In this work, we assume static scheduling is possible, and the actual schedule is an outcome of our methodology (see Section 3 and further).

METHODOLOGY
In this section, we describe two approaches for scheduling an application access trace onto a set of PM parallel access patterns: (1) a solution to determine the minimum number of PM accesses that cover the application access trace using an ILP-formulation, and (2) a faster alternative, in the form of a heuristic method which trades-off optimality for speed. We conclude this section by presenting our full methodology for building application-centric parallel memories, and a simple predictive model to calculate the performance of the resulting memory system.

From concurrent accesses to set covering
We express the problem of scheduling an application access trace onto a set of PM accesses as a particular instance of the set covering NP-complete problem. 4 Set covering.
Given a universe of n elements, a collection of sets = {S 1 , … , S k }, with S i ⊆ , and a cost function c ∶ → Q+, find a minimum-cost subset of that covers all elements of .
Optimally solving the set covering 4 problem requires it to be formulated as an Integer Linear Program (ILP), where variable indicates if set S i is part of the solution, c(S i ) is the cost of set S i , and the solution is constrained to have for each element e ∈ at least one An optimal schedule of an application access trace on a set of PM parallel accesses can be found by reducing this problem to a set covering one, and leveraging the ILP formulation. Although an application access trace contains a list of application concurrent accesses, we schedule each of those separately. For every application concurrent access, the universe is formed by all accesses. From the PM predefined parallel access patterns, we define as the collection of all possible parallel accesses in PM (see Algorithm 1). Finally, the solution obtained using an ILP solver, min , min ⊆ , is a list of sets which optimally cover the concurrent accesses and will be converted back into a sequence of parallel-memory accesses. Algorithm 1 shows how to generate , from which the minimal coverage will be extracted. Set ℙ contains the list of PM conflict-free accesses patterns, and it is obtained from the PM configuration. Set contains the coordinates of the application data. Each pair of an application element and an access pattern (ie, elements from and ℙ, respectively) is resolved into a set of coordinates of application elements, pa, by resolve_pattern (see Section 2.1). To map our problem to the ILP formulation above, we need to guarantee that the union of the collection of subsets in is equal to the universe . This is done by removing the elements that are not being accessed in the concurrent access, ie the elements in but not in , from the parallel access pa. The elements of will be all these S pa sets, for which it holds that ∪ S pa ∈ S pa = .
To solve our original problem, we are interested in finding the minimum collection of sets min such that ∪ S∈ min S = and min ⊆ , so the cost function will be defined as c(S pa ) = 1, ∀S pa ∈ . Once , , c are defined, an ILP solver can be used to compute min -the minimum collection of sets that covers the universe .

A heuristic approach
Because ILP is a major bottleneck in our framework, speedwise,* we have designed and implemented an alternative, heuristic approach, which trades-off optimality for speed. Our heuristic is based on the work of Vazirani, 4 and the solution is guaranteed to be within an harmonic factor from the optimal solution (extracted with the ILP approach).
The Greedy algorithm we have used to implement the heuristic solution is presented in Algorithm 2. E is a set used to keep track of the elements still to be covered with a parallel access, and it is initialized with , the set containing all the elements in the concurrent access. contains all parallel accesses from for a given PM configuration (Algorithm 1, Section 3.1). In each iteration, the parallel access S pa ∈ , which contains the maximum number of elements that still needs to be covered, is added to the solution, and the elements covered by S pa are removed from E.
Once all the elements in the application concurrent access have been covered, the algorithm returns the set of parallel access h containing the solution.
*For very small problem sizes, the runtimes of our Heuristic and ILP approaches are comparable. However, even for a problem size as small as a 32 × 32 matrix, the Heuristic becomes 26× faster than ILP, making the use of the ILP prohibitive.

FIGURE 2
An overview of our complete approach, from application to PM

The complete approach
Our complete approach is presented in Figure 2. We start from the Application Access Trace, a description of the concurrent accesses in the application, discussed in detail in Section 2.2. The Application Access Trace can be collected from a given source code performing static analysis or polyhedral analysis. 5,6 We test different parallel-memory configurations by providing different Configuration Files to our Memory Simulator. † Each Configuration File contains details regarding mapping scheme, number of parallel lanes, and capacity of the parallel memory. The Memory Simulator produces all the available parallel accesses, compatible with the given parallel-memory Configuration File, that cover elements contained in the Application Access Trace. The set of parallel accesses is then given as input to our ILP or Heuristic solver-implemented as described in Sections 3.1 and 3.2. The Solver selects the minimum number of parallel accesses that fully cover the elements in the Application Access Trace, thus producing a Schedule of parallel-memory accesses. The Schedule can then directly be used in the hardware implementation of the application parallel memory.
An important side-effect of our approach is that the information contained in the schedule can further be used to accurately estimate the performance of the generated memory system. Thus, to calculate the achievable average bandwidth of the memory system for the given access trace, we can ''penalize'' the theoretical bandwidth (ie, assuming that all lanes are fully used) by our efficiency metric: . Frequency is the frequency the PM is operating at, Bitwidth is the size of each element stored in the PM, and Lanes represents the amount of elements that can be accessed in parallel; N seq is the number of required sequential accesses and N elements is the total number of elements accessed by the PM using a Schedule.

EVALUATION
This section describes a statistical analysis, based on simulation results, of the potential benefits of PMs for different types of applications, characterized by their memory access patterns. It further compares the solutions obtained by our heuristic against the optimal solutions produced by the ILP algorithm (see Section 3).

Experiment setup
To empirically demonstrate the potential of parallel memories to improve bandwidth and, ultimately, provide speedup over non-parallel solutions even for non-dense memory access patterns, we propose an experiment where we test the PM for a large number of synthetic memory access patterns. We assume that the capacity of the PM is sufficient to contain the application data. For each pattern, we measure both the performance gain and the efficiency of using PMs. This experiment also enables us to compare the two algorithms for scheduling a memory access trace (see

Synthetic application concurrent accesses
The set of concurrent accesses-strided-is generated assuming an 8 × 8 data structure and using three parameters: offset, number of reads, and number of skips. The pattern is generated alternating series of reads and series of skips. The offset defines the number of elements to skip from the element [0][0]. The entire set of synthetic concurrent accesses has been generated using 8-by-8 patterns with offset varying from 0 to 7, number of reads varying from 1 to 8, number of skips from 1 to 8. This resulted in a total of 512 application access traces.

PM configurations
The Memory Mapping Functions (MMFs) used in the PM configurations guarantee conflict-free access to the following 2D patterns (see Figure 1): Rectangle, Diagonal, Secondary Diagonal, Row, Column, and Transposed Rectangle. We assumed a memory capacity sufficient to store all application data and experimented with a PM width, M, from 2 to 8 (M = 8 is sufficient to allow full rows/columns/diagonals to be read from our synthetic concurrent accesses), and all combinations of the PRF access patterns. In total, we tested 448 different PM configurations.

Evaluation metrics
We introduce two metrics to evaluate how an application benefits from a parallel memory: speedup and efficiency.
Speedup is a measure of the performance gain from using a custom parallel memory, defined as Speed − up = N seq N par . N seq refers to the number of accesses required using a sequential memory, ie, equal to the number of elements in the application concurrent access, N par is the number of parallel-memory accesses, obtained using the algorithms in Section 3.
Efficiency is a measure of the ''wasted accesses'' when using a custom parallel memory, defined as Efficiency = N seq N elem . N elem is the total number of elements accesses by the PM and it is equal to N par × M, where M is width of PM. We note that efficiency is an indirect measure of the overhead of a parallel memory for a sparse access and can be correlated with the power efficiency of the memory system.

Results
We have scheduled all 512 synthetic concurrent accesses (Section 4.1) on all 448 memory configurations (Section 4.1) using both the algorithms proposed in Section 3, ie, ILP and heuristic. To determine whether the custom parallel memories are successful in improving the performance of different applications, we analyze speedup; to determine whether the heuristic algorithm can be used as a replacement of the ILP-based solution, we analyze the observed trade-off between the optimality (by the ILP method) and speed (by the heuristic method). Figure 3A shows the speedup results, grouped per PM-width. We make the following observations.

Speedup
• The bottom parts of the plots, indicating low speedups, are very narrow, showing that only very few concurrent accesses did not benefit from using PM. This is correlated to the sparsity of the memory accesses in the concurrent access and the fact that the parallel access patterns we used only allow dense parallel accesses.
• The top parts of the violins, corresponding to high speedups, are also narrow, indicating that only few concurrent accesses can gain maximum speedup. Moreover, the figure also shows that, for odd numbers of memories (3,5,7), the occurrence of close-to-maximum and maximum speedup is very rare: in fact, 1-5 patterns, at most, reach the maximum.
• The majority of the concurrent accesses lies in between those two extremes, showing that they gain significant speedup by using PM, but that it is not possible to fully utilize the all memory banks.
We note that our efficiency results (not included due to space limitations) show a similar picture: few (concurrent access, PM configuration) pairs gain maximum or minimum efficiency, while the average efficiency varies between 0.8 for 2 memories and 0.58 for 8 memories. To see in how many cases the difference between the ILP and heuristic approaches is significant, we compute the ratio

Speedup heu Speedup ILP
and plot the density distribution of this ratio in Figure 3C. In the large majority of cases, the speedups are similar, with a loss is less than 15%; the worst result obtained by the heuristic is 53% of the optimal speedup for one single configuration. These results indicate that the heuristic algorithm is acceptable as a replacement of the ILP when quick estimation is required.

DESIGN AND IMPLEMENTATION
In this section, we briefly present our approach for designing PolyMem to fit a given application and further dive into the implementation of MAX-PolyMem, the specific Maxeler-based version of our memory system. This implementation is open source and is available online at github.com/giuliostramondo/RAW2018. 7

PolyMem
Our parallel memory is based on PolyMem, a design inspired by the Polymorphic Register File. 8 PolyMem is a non-redundant parallel memory, using multiple lanes to enable parallel data access to bi-dimensional data structures, and a specialized hardware module that enables parallelism for multiple access shapes. • ReO: Rectangle.

MAX-PolyMem design and implementation
The hardware implementation and performance analysis presented in this work are all based on the Maxeler version of PolyMem, MAX-PolyMem. 2 Figure 4 shows a diagram describing MAX-PolyMem, our MaxJ PolyMem implementation; we further refer to blocks in this figure in bold and to signals with a spaced-out font. Because PolyMem behaves as a 2D memory, parallel application accesses are made using two coordinates, (i,j), and the shape of a parallel access, AccType. DataIn and DataOut represent the data which is written to and read from MAX-PolyMem. The core of MAX-PolyMem's design consists of a 2D array of memories (p × q BRAMs), where each bank is identified using two coordinates.
These are used to store the data in a distributed manner. In Figure 4, eight such memories are illustrated (M0-M7); these are the Memory Banks, also called memory modules. The number of banks defines the number of data elements which are read/written in parallel per data port, referred to as lanes.
Based on the (i,j) coordinates and the requested access type AccType, the AGU expands the parallel access in its individual components by computing the coordinates of all the accessed elements (p × q addresses in total). This operation is performed for the write port and for each read port, so that one write access and one read access for each read port can happen independently at the same time.   We note that our design is implemented using two types of Shuffles. Given a reordering signal, the regular Shuffle reorders the elements, while the Inverse Shuffle, with the same reordering signal, restores the initial order. In this design, therefore, the Write Data Shuffle is implemented using an inverse Shuffle, while the Read Data Shuffle is implemented using a regular Shuffle.

MAX-PolyMem performance Experimental setup
To analyze the performance of MAX-PolyMem, we propose an extensive experimental setup, by varying the capacity, the number of memory lanes, the schemes, and the number of read ports of the parallel memory (see Table 1). We focus our evaluation on performance, thus memory bandwidth is our metric of choice. For all experiments in this paper, we use a Maxeler Vectis board that uses a Xilinx Virtex-6 SX475T FPGA ‡ featuring 475 k logic cells and 4 MB of on-chip BRAMs. All our experiments configure PolyMem for a data width of 64 bits. Our design is easily configurable: a simple configuration file sets, at compile time, the required DSE parameters. We collected information regarding the FPGA resource usage and the clock frequency for each configuration. We have further computed the peak read and write bandwidth that can be achieved; this peak performance has been empirically confirmed by actual measurements with the resulting FPGA design.

Performance results
In its role as a parallel memory, the most important performance metric for MAX-PolyMem is memory bandwidth. We compute the maximum bandwidth assuming all accesses use the full width of the memory. The main parameters influencing the bandwidth are design clock frequency, which varies depending on the MAX-PolyMem parameters (see Table 2), the number of lanes, and the number of read ports.   significantly increase the bandwidth. We also note that bandwidth is reduced if the number of lanes and ports is kept constant, but the capacity of PolyMem is increased. This is most likely due to the additional pressure put on the synthesis tools to place and route all the additional BRAMs.
Please note that, for the applications that utilize the read and write ports simultaneously, the total delivered PolyMem data rate is the sum of the bandwidth delivered by all individual read and write ports.
Our results have lead to two main observations. 2

AN EXTENSIVE CASE-STUDY: SPARSE STREAM
We evaluate the feasibility and performance of our complete approach by designing and implementing 19 parallel-memory accelerators on our FPGA-based system (Maxeler Vectis). Our 19 accelerators belong to a suite we call Sparse STREAM, which is a sparse version of the well-known STREAM benchmark. 9 Our evaluation focuses on the memory bandwidth achieved by our parallel memory.

Sparse STREAM
To prove the feasibility of our approach, from application access traces to hardware, we adapt the STREAM benchmark, 9,10 a well-known tool for memory bandwidth estimation in modern computing systems, to support sparse accesses.
The original STREAM benchmark uses three dense vectors-A, B, and C-and proposes four kernels: Copy (C = A), Scale (A = q·B), Sum (A = B+C), and Triad (A = B + q · C). However, the original STREAM does not challenge our approach because it uses dense, regular accesses. We therefore propose Sparse STREAM, an adaptation of STREAM which allows 2D arrays and configurable sparse accesses. Table 3 presents 19 possible variants of Sparse STREAM, labeled based on their read access density. The main difference between these variants is the number of sequential accesses, N seq . We also distinguish between regularly strided patterns, generated as explained in Section 4.1, and random patterns generated using a pseudo-random function which can guarantee an average access density of the pattern.
We apply our methodology for each variant. Thus, for each variant, we obtain the (close-to-) optimal schedule per access scheme. The schedule is characterized by the number of parallel accesses N par , and the total number of accessed elements N elements (Section 3), from which we calculate speedup and efficiency per access scheme. Because regular memory systems often optimize row-wise accesses, we define a Row Only scheme as our baseline. We further present results for two PolyMem schemes (namely, ReRo and RoCo) in Table 3. We select the best performing PolyMem scheme to test in hardware.
The final step in our approach is implementing the schedule in the hardware of our parallel-memory accelerator. The implementation of the STREAM benchmark for MAX-PolyMem (figure updated from our previous work 2 ). All transfers between host (the CPU) and PolyMem (on the FPGA) are done via the PCIe link

Sparse STREAM in hardware
We have designed a flexible template for implementing STREAM using MAX-PolyMem. 2 § A high-level view of our design is presented in Figure 6.
We use this design as a template for our Sparse STREAM accelerators. Thus, the remaining challenge is to enable the controller to orchestrate the parallel-memory operations based on the calculated schedule. Our current prototype stores the schedule, which contains information regarding the required sequence of parallel accesses (coordinates, shape, and mask), in an on-chip Schedule memory. The controller reads, in every clock cycle, one entry from the schedule (coordinates, shape, and mask), and executes the required parallel-memory access. The host can (dynamically) load a schedule in this memory, as soon as such a schedule is available.

Performance results
We have implemented all 19 STREAM variants in hardware by configuring MAX-PolyMem, for each test-case, with a memory of 4 MB containing 261120 elements (ie, the maximum capacity available fitting the arrays A, B, C and the schedule memory), and the best scheme (see Table 3). We measure the bandwidth of our 19 Sparse STREAM kernels (average over 10000 runs). ¶ The results-predicted vs measured-are presented in Figure 7. We make the following observations.
• Our performance model (see Section 3) accurately predicts the performance of the memory system (below 1% error in most cases).
• For 6 out of the 9 regular sparse STREAM variants, we can achieve close to optimal speedup due to our parallel memory being multi-view and polymorphic.
• For almost all of the random sparse STREAM variants, except for 60, 80, and 100, the added flexibility given by the RoCo and ReRo scheme results in performance increase over the Row Only baseline.
• In the case of the random sparse STREAM variants, the performance gap given by using alternative sets of shapes is reduced. This is due to the randomicity of the accesses which prevents the generation of repeated structures, hence averaging the benefits of using one scheme over another.
• Our STREAM PolyMem design uses only 25.98% of the logic available in the Vectis Maxeler board, leaving over 74% available for increasing the complexity of the application kernel or performing, concurrently, a different task. More information regarding the resource usage is available in our previous work. 2 Overall, our experiments are successful: we demonstrated that the schedule generated by our approach can be used in real-hardware, and we showed that the measured performance is practically the same with the predicted one.

RELATED WORK
To the best of our knowledge, the framework presented in this work is the first one that deals with configuration and usage of parallel memories for applications performing sparse concurrent accesses. However, we can identify three research areas related to our work: application access pattern analysis, the design of custom parallel memories, and the generation of memory systems on FPGAs.
Memory access patterns are extremely relevant for overall application performance, and they have been extensively studied in the past. For example, in polyhedral optimization, information regarding the application access patterns is used to reorder the computation and increase the application data locality. Many automated tools are available for this purpose, with Polly, 5 Pluto, 12 and Graphite 6 among the most popular.
However, these techniques are specifically designed and most often used to improve the software to match the memory architecture of a given system. We approach the same problem from a different angle, as we aim to optimize the memory system to match the computation.
Parallel memories designed to improve system memory bandwidth have been proposed in research since the 1970s and remain of interest today. Parallel memories which use a set of predefined memory mapping functions to enable parallel accesses in a set of predefined shapes [13][14][15] have improved to better support more shapes, multiple views, and polymorphic access. 8 Similarly, approaches which derive an application-specific mapping function 16,17 have also recently emerged, constantly improving the efficiency and performance of the generated memory systems. The framework presented in this work gives the ability to compare these different memory mapping functions, predict their performance and optimize their use in the case of application performing sparse accesses. The current version uses a polymorphic parallel memory with fixed shapes, to which we add the novel analysis and configuration methodology, and an extensive evaluation. In the near future, we plan to explore a second back-end, based on generating application-specific mapping function, enabling an even finer-grain customization of the memory system.
Finally, with the huge increase in popularity of FPGAs through HLS, a lot of research has been invested in building application-specific caches for FPGAs. Although successful, such research [18][19][20] does not (yet) address parallel and/or polymorphic memories. Our work is similar in its goal, ie, to make more efficient use of the FPGA BRAM memory by providing a more productive user interface, but we propose a solution based on a 2D scratch-pad memory, with automated, application-specific read/write operations.
In summary, our approach draws ideas from these topics, and it builds upon our previous work, 2,3# but it is the first work to describe in detail, evaluate, implement, and benchmark a complete methodology, from application to hardware, to build application-centric parallel memories on FPGAs.

CONCLUSION AND FUTURE WORK
Modern heterogeneous systems currently feature accelerators that offer massive parallelism for compute-intensive applications, but often suffer from memory bandwidth limitations. In this work, we have proposed a solution to tackle these limitation by instantiating and using parallel-memory accelerators. By basing our approach on a highly configurable parallel-memory system, we are able to instantiate application-specific accelerators, which provide both high bandwidth and high efficiency for the specific memory access patters of the given application.
Our methodology is an application-to-accelerator workflow, which performs the following actions: analyzes the application access trace, configures and builds a custom non-redundant parallel memory (eg, PolyMem), optimized for the kernel of interest, generates the parallel-memory accelerator in hardware, and embeds it in the original host code.
We validated of our methodology by analyzing a large set of memory access patters, with different sparsity levels. Our results demonstrate that parallel memories are not only useful for dense accesses, but that the true gain for PolyMem (due to its multi-view, polymorphic properties) is for sparser accesses, where we can still gain significant speedup at high efficiency. Overall, the results over 500 patterns show at least 10% gain for more than 90% of the cases. Given that the entire methodology is automated, this gain is virtually free of any intervention from the programmer.
We have further proven the feasibility of the approach by generating the hardware accelerators for 10 different instances of Sparse STREAM.
We demonstrated that we can instantiate and benchmark all 10 designs in real hardware (ie, a Maxeler FPGA system and the MAX-PolyMem implementation of PolyMem), and our experimental results demonstrate clear bandwidth gains, which closely match our model's predictions.
Our on-going work focuses on the analysis of more applications. In the near future, we aim to improve/automate the access traces extraction, to design more efficient ways to integrate the parallel-memory accelerator into the host application and to extend the model toward accurate full-application performance prediction.