Compressed Neighbour Lists for SPH

We propose a novel compression scheme to store neighbour lists for iterative solvers that employ Smoothed Particle Hydrodynamics (SPH). The compression scheme is inspired by Stream VByte, but uses a non‐linear mapping from data to data bytes, yielding memory savings of up to 87%. It is part of a novel variant of the Cell‐Linked‐List (CLL) concept that is inspired by compact hashing with an improved processing of the cell‐particle relations. We show that the resulting neighbour search outperforms compact hashing in terms of speed and memory consumption. Divergence‐Free SPH (DFSPH) scenarios with up to 1.3 billion SPH particles can be processed on a 24‐core PC using 172 GB of memory. Scenes with more than 7 billion SPH particles can be processed in a Message Passing Interface (MPI) environment with 112 cores and 880 GB of RAM. The neighbour search is also useful for interactive applications. A DFSPH simulation step for up to 0.2 million particles can be computed in less than 40 ms on a 12‐core PC.


Introduction
Approaches for the Smoothed Particle Hydrodynamics (SPH) neighbour search rarely discuss the storage of the computed neighbour lists. This is due to the fact that the storage of neighbour lists had not necessarily been required in state-equation solvers with very few iterations over particle neighbours per simulation step. Also, storing neighbours can constitute a significant memory overhead which is particularly prohibitive for GPU implementations.
Current SPH approaches, however, employ an increasing number of neighbour iterations. State-of-the-art incompressibility solvers such as Predictive-Corrective Incompressible SPH (PCISPH) [SP09], Implicit Incompressible SPH (IISPH) [ICS*14], Divergence-Free SPH (DFSPH) [BK17] and Position Based Fluids (PBF) [MM13] are iterative. Boundary handling can be iterative as proposed in, e.g. Pressure Boundaries [BGI*18] or for strong solid-fluid coupling [GPB*19]. And there exist a growing number of implicit formulations for, e.g. viscous [PT17,WKBB18], elastic [PGBT18] or ferromagnetic materials [HHM19] that employ iterative solvers. In such iterative solvers, storing and reusing neighbour lists is a natural choice. This paper focuses on the storage of compressed neighbour lists in iterative SPH solvers. We propose a novel compressed representation of neighbour lists that is inspired by Stream VByte [LKR17]. Furthermore, the compression scheme is embedded into a novel variant of the Cell-Linked-List (CLL) approach, e.g. [HGE74,DCGGM11,IABT11]. We show that the proposed neighbour search and query outperform state-of-the-art methods such as compact hashing [IABT11] in terms of memory consumption and computing time. Using our neighbour processing, fluid scenes with up to 200 million SPH particles can be computed on a single PC with 32 GB memory.
The proposed compression scheme works on a particle list that is sorted with respect to the index of a space filling curve computed for a virtual uniform grid. Figure 1 illustrates the setting and Figure 2 indicates the sorted particle list where particles in the same space cell are adjacent. This array is queried to find the neighbour list for each particle. Figure 3 shows the same array with highlighted particles that constitute the neighbour list of an exemplary particle. The fact that all neighbour lists are represented by a smaller number of clusters in the sorted array is the starting point for the proposed compression.    The numbers indicate index differences in the sorted array. We propose a novel compression scheme to store the index differences.

Our contribution
r We propose to store compressed neighbour lists in SPH. We show that this concept enables fluid scenarios with 1.3 billion particles on a single PC and more than 7 billion particles on six connected PCs.
r We propose a novel scheme for the compressed storage of integervalued index differences. In contrast to, e.g. Stream VByte [LKR17], small values are directly encoded in the control bits without using any data byte. On the other hand, values of intermediate size (encoded with two or three data bytes in Stream VByte) require four bytes in our approach. We show that the proposed non-linear mapping of values to data bytes outperforms Stream VByte in SPH scenarios due to the clustered representation of neighbours in the sorted particle list. r Our compressed representation is integrated in a novel CLL approach. We propose a novel compact and memory-efficient representation of the cell-particle relation in combination with an adapted neighbour query using the BigMin-LitMax algorithm [TH81]. In contrast to [DCGG13], infinite domains can be handled by not representing empty cells. In contrast to compact hashing [IABT11], only one particle reference is stored per used cell. We show that our CLL approach is faster than compact hashing. In particular, 0.2 million SPH particles can be interactively simulated on a 12-core PC with our approach. r We present various analyses. We show performance comparisons to compact hashing and memory comparisons to Stream VByte. We show that differences in computing times are negligible when using compressed instead of uncompressed neighbour lists. We further show that compressed neighbour lists are particularly advantageous for larger kernel supports with an increasing number of neighbours where memory savings of up to 87% can be obtained. Finally, we illustrate the parallelizability and scalability of the neighbour search and storage by presenting scenarios on a 12-and 24-core PC and in a Message Passing Interface (MPI) environment with 112 cores.

Compression
Our work focuses on integer-valued compression techniques. In the sorted array of a CLL approach, SPH particles are identified by indices represented with 32-bit unsigned integers. Also, index differences are integer values. There exist a large variety of suitable compression methods and recent surveys can be found, e.g. in [ZZL*15] and [LB15].
We consider methods that encode a block of integer values with control and data bytes. In binary packing, a fixed number of data bits is used for each value within the same block of, e.g. 128 integers [NAM10]. The decreasing efficiency of this fixed pattern in case of outliers in a block has been addressed, e.g. by the patched frame of reference (PFOR) [ZHNB06]. Recent PFOR implementations identify outliers in a sequence and process them separately, e.g. NewPFD [YDS09], OptPFD [YDS09] and FastPFOR [LB15].
As the SPH neighbour lists are represented in a smaller number of clusters in the sorted array of our CLL approach, we prefer compression schemes that handle small index differences within a cluster and few large index differences between clusters with an adaptive number of data bits. For example, VByte [TH72] is a popular choice to compress sequences of integers with variable data patterns, e.g. [PKL15]. VByte is byte-oriented, but its implementation involves a branch and potential branch mispredictions may impair the performance.
VarIntGB [Dea09] alleviates the performance problems of VByte. It is also well suited to make use of modern hardware acceleration Single Instruction Multiple Data (SIMD) [SGR*11, LB15, ZZL*15]. However, in VarIntGB there is a strong data dependency in the decoding procedure, as the location of the next control byte depends on the current control byte. This increases the risk that the processor remains underutilized, as it cannot load and start processing upcoming control bytes.
Stream VByte [LKR17] overcomes this data dependency by storing the control bytes in a stream that is separate from the data bytes.
In their experiments, [LKR17] reported Stream VByte to be the fastest of the above-mentioned algorithms. Therefore, we base our compression scheme on Stream VByte. In contrast to Stream VByte, however, we propose a non-linear mapping from integer values to the number of data bytes. In particular, the values one and two are represented in the control byte without the usage of any data byte. We show that this variant outperforms Stream VByte in the specific context of the CLL-based neighbour search in SPH simulations.

Neighbour search
SPH requires information of neighbouring particles to compute the particle interactions. Using a naive approach, in which each particle is a potential neighbour of all other particles, is unfeasible for a large number of particles N due to a computational complexity of O(N 2 ) and is also unnecessary since particle pairs that are farther than the SPH support length have a kernel function value of zero and can thus be discarded.
To accelerate the neighbourhood query, hierarchical tree data structures that can be built in O(N log N ) and queried in O(log N ) have been applied, e.g. [WS95, KAG*05, HCM06, APKG07], however, mostly in astrophysical problems with long-range interactions [Spr05,GR11,HBMW11]. In contrast, researchers in computer graphics seem to prefer uniform grids, e.g. In addition to uniform grids, Verlet-Lists (VL) [Ver67,VBC08,WMRR17] have been proposed to alleviate the performance issue of recomputing the neighbour lists in each simulation step. As the neighbour lists are created from a larger support, they can be retained for several steps. Due to memory restrictions on low memory systems like GPUs, using the VL approach might not always be an option. In this case, a CLL approach is preferred, e.g. [HGE74, HKK07, GSSP10, DCGG13].
CLL approaches require the construction of a data structure to link a cell of the underlying uniform grid with a list of particles. In GPU implementations, e.g. [Gre10, DCVB*13, GEF15, MRSD15, WRR18], all grid cells are typically represented, including empty ones. For sparsely filled domains, where the number of grid cells is much larger than the number of particles, this approach is not memory-efficient. In order to represent only non-empty cells, [Gre10,IABT11,TLTM18] propose to use an approach based on spatial hashing. However, hash tables are designed to scatter data according to spatially close cells in order to minimize hash collisions. This leads to low cache-hit rates for the insertion and query. Furthermore, race conditions while inserting, e.g. due to hash collisions, have to be handled appropriately.
In contrast to previous works, we address the neighbour search problem by a novel compact and memory-efficient representation of the CLL in combination with an adapted neighbour query. Thereby, we avoid the issues associated with hashing while infinite domains can be handled with low memory consumption.

Method
This section describes the concept of our novel CLL variant. We start with the proposed compression scheme for neighbour lists in Section 3.1. Afterwards, the embedding of the compressed neighbour lists into the novel CLL variant is explained in Section 3.2. Implementation details are given in Section 4.

Compression
We want to compress neighbour lists that are represented in the sorted array of our CLL approach (see Figure 3). If n i is a particle index in the sorted array and a neighbour list consists of N particles, we are interested in a lossless compression of N strictly increasing non-negative integers n 1 , n 2 , . . . , n N . As motivated in Section 2, our compression scheme is inspired by Stream VByte [LKR17]. Stream VByte encodes the number of non-zero data bytes required to represent an integer in a 2-bit binary mask (see Table 1). Four of these binary masks build one control byte. As the number N of integers is known, the first 2N/8 bytes in the output sequence are used to store the control bytes. The control bytes are followed by the data bytes which are made up of the non-zero bytes of the input integers. For efficiency, integers are encoded and decoded in blocks of four by utilizing SIMD hardware instructions. To encode such a block, 5-17 bytes are required.
To minimize the range of the considered data, deltas or gaps with δn i = n i − n i−1 are useful. In our context, many deltas are small and thus well compressible [PKL15]. To recover the original data from the deltas, the first value n 1 of a sequence has to be stored together with the deltas. Then, a prefix sum n i = n i−1 + δn i is used to reconstruct the rest of a sequence [LF80]. The value of each recovered integer, however, can only be calculated after the preceding value is known. To alleviate this problem, deltas could be computed on a four-by-four basis, i.e. δn i = n i − n i−4 as proposed in [LB15]. Another approach is to compute the deltas according to the minimal value, i.e. the first value in our application: δ i = n i − n 1 . This approach is commonly referred to as frame of reference (FOR) [GRS98]. Although both approaches are faster than the original differential coding, they tend to generate larger deltas and inferior compression ratios.
In our specific application of SPH neighbour lists, the deltas are always greater or equal to one. Subtracting one from each delta makes them even smaller, but still non-negative, i.e. δn i = n i − n i−1 − 1 ≥ 0. As discussed in Section 5.1 and illustrated in Figure 7 for an exemplary scenario, many deltas are either zero or one. In consequence, we propose to improve the compression rate of Stream VByte by introducing a non-linear mapping from deltas to data bytes. Instead of using 1, 2, 3 or 4 data bytes, we encode the values 0 or 1 with control bits only, and use 1 or 4 data bytes for all other values (see Table 1). No data bytes are required for the values 0 and 1. Values in the interval [2, 2 8 ) are represented with one data byte. Values larger or equal to 2 8 are represented with four data bytes.  The compression algorithm for a list of neighbours works as follows. The first neighbour n 1 is stored uncompressed. Then, deltas are computed with δn i = n i − n i−1 − 1 for all other neighbours in the list. From the deltas, we gather the control bits as specified in Table 1 and store the final control byte sequence. For deltas not equal to zero or one, we encode δn i with 1 or 4 data bytes and store the final data byte sequence.
For decoding, these steps are reversed. We read a control byte, extract the control bits, read the specified number of data bytes and compute a prefix sum using the first uncompressed value n 1 to finally restore the original values n 2 , . . . , n N . Figure 4 illustrates an exemplary setting.

Neighbour search
Our compression scheme bases on a novel CLL variant for the neighbour search. In particular, the CLL query has to compute neighbours with sorted indices to guarantee non-negative deltas for the compression. Furthermore, our CLL variant is motivated by memory efficiency in support of the compressed neighbour lists.
We consider a 3D grid that consists of equally sized cube cells, e.g. [KS09,Gre10]. A particle with position r i = (x, y, z) is mapped to a 3D coordinate of a grid cell with where we subtract the minimum position r min = (min x, min y, min z) to ensure that all grid coordinates (k, l, m) are positive. The grid is only implicitly represented by the edge length of a cell, i.e. the SPH kernel support in our application. Each 3D coordinate (k, l, m) is then mapped to a unique 1D cell index c ≡ c(k, l, m), e.g [PDC*03, KS09, IABT11] (see Figure 5a).
In order to improve spatial locality and in turn reduce memory transfer by improving the cache-hit rate, we compute the cell index based on Morton codes [Mor66]. If k, l and m are represented by B bits, i.e. k = k B · · · k 1 , l = l B · · · l 1 and m = m B · · · m 1 , the cell index c is computed by bit interleaving, i.e. c(k, l, m) = k B l B m B · · · k 1 l 1 m 1 [PF01,LSY01]. After that, the particle list is sorted according to this index. Although parallel radix sort is often used for sorting, e.g. [OD08, Gre10, GSSP10], we employ a parallel stable merge sort [MRR12] as it outperforms our radix sort implementation even for a large number of particles. In the sorted array, particles that belong to the same grid cell are stored contiguously in memory (see Figure 5c).
In our novel CLL variant, we propose to build a compact list of all non-empty cells from the sorted particle list. For each non-empty grid cell, we store its cell index and a reference to its first particle in the sorted particle list. This is in contrast to compact hashing  [IABT11], where references to all particles in a cell are stored. It is also in contrast to [DCGG13], where all empty and non-empty cells are represented and store references to their first and last particle in the sorted particle list. We can infer the number of particles inside a cell by computing the difference between the particle reference of the current to the particle reference of the next grid cell in the compact list. To create the compact cell array, first, a marker is computed from the sorted particle list (see Figure 5d). A particle is marked with one if its cell value differs from the preceding particle in the sorted list, or with zero otherwise. A parallel scan (prefix sum) over the markers provides the offsets for the compacted cell array. Note that the marker and scan value are only computed, but not stored. Finally, the particle index is written to the compact cell array at the scan location if the particle's marker value is one.
For the neighbourhood query, particles inside a given bounding box must be found. We define a bounding box as a range between a minimum and a maximum cell index and adopt an idea of [DCGG13]. As the grid cells follow the order of the underlying curve, consecutive cells can be queried together (see Figure 5b). However, not all cells in a given range contain potential particle neighbours. In order to maximize query performance, the range is split into smaller subranges. To compute the subranges, we propose to use the BigMin-LitMax algorithm [TH81,OM84]. Finally, to compute the lower and upper particle index bound for a subrange, we perform a ternary search on the cell array with logarithmic complexity in the number of non-empty grid cells. Due to an ordered processing of the subranges, the indices of the estimated neighbours in a set are ordered which is a pre-requisite for our proposed compressed neighbour lists. This also results in an optimized caching and hardware pre-fetching behaviour for the SPH computations.

Implementation
The implementation of our CLL approach is summarized in Algorithm 1. First, the particle array is sorted with respect to the cell index. Then, marker and scan values as indicated in Figure 5(d) are computed on the particle array to enable the generation of the compact cell array. After that, the BigMin-LitMax algorithm is performed for all cells in the compact cell array. The particles in the resulting subranges of cells are potential neighbours. Found neighbours are compressed and stored per particle. All loops are well suited for parallelization as there are no data dependencies.
Most relevant indices and references are simply computed or directly looked up in the two arrays. The minimum and maximum cell indices computed by the BigMin-LitMax algorithm, however, have to be searched in the compact cell array. Here, we employ a ternary search algorithm with a fallback to linear search [SGL09]. The search interval is recursively cut in three parts until the remaining search interval contains less than 16 elements. For the remaining elements, we fall back to a linear search algorithm. Linear search has a worse theoretical run-time complexity compared to ternary search. In practice, however, it can be more efficient for searching small intervals as it is suitable for modern SIMD-hardware. Thus, it can be implemented in a conditional-free way while testing multiple elements at once. For further performance improvements, we cache the last 1024 results of the search algorithm. If a previously processed value is found in the cache, it can be reused without search.

SPH setting:
We have integrated our novel CLL variant and the compressed neighbour representation into a DFSPH framework [BK17,BGPT18]. If not stated otherwise, we use the Wendland C 6 kernel [Wen95] with a support of two times the particle spacing h for all SPH interpolations. We follow [BL99, Gan15] and use linearexact gradient estimates for computing the pressure acceleration and divergence of the velocity.
Simulation features: In free-surface scenarios, we apply a drag force to the fluid-air interface as described in [GBP*17] and model surface tension as proposed in [AAT13]. Viscosity is modelled as proposed by [MFZ97]. Solid boundaries are represented with a single particle layer of non-uniform size [AIA*12]. Pressure forces at solid boundaries are computed with extrapolated pressure according to [BGPT18].
Hardware and software: The presented scenarios have been computed on three different systems. (1) One workstation with 12 cores and 32 GB of RAM (2.6 GHz Intel Xeon E5-2690).
(2) One workstation with 2 × 12 cores and 256 GB of RAM (2 × 2.7 GHz Intel Xeon E5-2697). (3) Six workstations with 112 cores and 880 GB of RAM (two 2 × 12-core 2.7 GHz Intel Xeon E5-2697 and four 2 × 8-core 3.1 GHz Intel Xeon E5-2687W). The six workstations are connected via Gigabit Ethernet. All computations are fully parallelized with Intel Threading Building Blocks [Phe08]. For the six connected workstations, an MPI environment is realized. Like [TSG14], we use Orthogonal Recursive Bisection (ORB) [Fox88] for spatial domain decomposition to distribute the particles to the computation nodes. The load balancing is done with a proportional integral controller [FE08, OLTG*16]. The ray-traced images are rendered with PreonLab [FIF19]. The accompanying video S1 is generated with 50 frames per second.

Scenarios:
We use a dam break ( Figure 6) and a wind tunnel ( Figure 10)

Space filling curves
The space filling curve is an important degree-of-freedom in our CLL approach that affects the performance of the index computation and neighbour list creation. It also influences the compression rate of the stored neighbour lists. Therefore, we compare XYZ (also used in [PDC*03, DCGG13]), Morton (also used in [IABT11]) and Hilbert curves (also used in [GRLS18]) with respect to performance and compressibility using the dam break scenario. All considered curves are discussed, e.g. in [Bad12].
In a first experiment, we measure the wall-clock time to compute the curve index, i.e. the cell index for the particles (Lines 1-3 in Algorithm 1). Furthermore, we measure the time required to search and store the (compressed) neighbour lists for the particles (Lines 6-17 in Algorithm 1). This is an indicator of the spatial locality that is obtained by a space filling curve. As spatial locality also influences all SPH computations, we also measure the total computation time required per simulation step including neighbour search and storage, pressure computation and particle advection. The results of our timing experiments are summarized in Table 2.
The measurements in Table 2 indicate that the curves have varying performance in the cell index computation, the neighbour list creation and the SPH computations. For example, the index  computation of a Hilbert curve is rather expensive and the neighbour list creation is comparatively slow. Interestingly, the SPH computations seem to benefit from the Hilbert curve which results in an improved overall performance compared to the XYZ curve. The Morton curve resulted in the best performance for creating the neighbour lists and also in the best overall performance.
In a second experiment, we measure how well the tested space filling curves are suited for the compression of the neighbour lists. After finding all particle neighbours and storing them in ascending sorted order, we compute deltas δn i = n i − n i−1 − 1 for all particles and their neighbours (see Figure 7). All tested curves result in similar distributions. For the Morton [Mor66] and Hilbert curve [Hil91], however, approximately 87% of the differences lie in the range [0, 256). As these values are higher than the 87% for the XYZ curve, the Morton and Hilbert curve should lead to improved compression rates. As the Morton curve has an improved overall performance compared to the Hilbert curve, we propose to sort space cells, i.e. particles, according to the Morton curve.

Compression scheme
We compare the compression ratio of our scheme to existing variants within our CLL approach. We also measure the time to search and store all particle neighbours and we measure the total time to compute a simulation step. We particularly compare to the uncompressed case to illustrate the memory saving, but also the computing overhead for encoding and decoding of the neighbour lists. The measurements in Table 3 show that the memory requirement for a neighbour index can be reduced from 4B in the uncompressed case to 0.851B using our compression scheme. This corresponds to an average memory saving of 87%. In particular, our variant requires less memory than Stream VByte. The encoding and decoding overhead for all computations within a simulation step is approximately 87% compared to the uncompressed case. Figure 8 illustrates that the average size of a neighbour set can be reduced from 120B in the uncompressed case to 25B using the proposed scheme.
Our compression scheme yields the best compression rate. This is due to the proposed handling of small differences between two consecutive neighbour indices. On the other hand, the cost for storing larger differences is higher with our compression scheme compared to Stream VByte. As large differences are only an exceptional case in our SPH fluid setting, our proposed compression method outperforms Stream VByte. Although fewer bytes are read from memory, our experiments indicate that decoding the neighbour list is associated with a small performance overhead.

CLL approach
In Table 4, we compare the performance and memory consumption of our CLL variant with compact hashing as proposed in [IABT11]. Our CLL approach requires significantly less memory than compact hashing. Building and querying the respective data structures to estimate and store the neighbour lists is also faster with the proposed compacted cell array and search algorithm.

Scaling with the number of neighbours
The support length = s h with s ∈ R + governs the number of particle neighbours. For example, increasing the support length from 2 h to 3 h results in an increase from approximately 30-105 neighbours per particle. While larger supports are interesting in terms of simulation accuracy, their usage is typically limited by the significant increase in memory consumption. Our compressed neighbour lists alleviate this issue. Table 5 shows performance and memory measurements for varying supports. It can be seen that the compression ratio gets better for a growing support. These measurements illustrate that our compressed neighbour lists and the entire CLL approach are particularly useful for SPH simulations with larger supports. This is also emphasized in Figure 9 where memory requirements for neighbour lists are shown for varying supports. It can be seen that, e.g. compressed neighbour lists for a support of 3.5 h with 184 neighbours require less memory than uncompressed neighbour lists for a support of 2 h with 30 neighbours as indicated in Figure 8.   Table 6 shows how our CLL approach scales with the number of particles. Therefore, we use the dam break scenario with varying resolutions resulting in 0.2, 11.6, 206.2 and 7440.9 million particles. We also use the wind tunnel scenario with 1302.1 million particles. Memory consumption and computation times for the neighbour lists and for entire simulation steps are given for three different hardware settings. We discuss three special settings.

Scaling with the number of particles
Interactive SPH simulation on a 12-core PC: We are able to compute 25 simulation steps per real-time second for 0.2 million particles. DFSPH is used with two divergence-free iterations and two density-invariance iterations.
Large-scale SPH simulations on a single workstation: Our neighbour search and storage enables the simulation of 206.2 million SPH particles on a 12-core PC using only 28 GB of memory. Even more remarkable, 1302.1 million SPH particles can be processed on a 24-core PC using 172 GB of memory. Such particle number have Large-scale on multiple workstations: By connecting multiple workstations, we are able to simulate the dam break scenario with a resolution that results in 7440.9 million particles. This setting requires 872 GB of memory without linear-exact kernel gradients. The performance is limited by the speed of our network and the unequally distributed amount of available memory per workstation. Nevertheless, we can simulate one step of the 200 million dam break scene in 15.254 s in this setting rather than 55.857 s using one 12-core PC.
Compressing the neighbour lists is particularly appropriate for scenes with a large particle numbers to capture small-scale details. This is illustrated in the wind tunnel scenario in Figure 10 which is simulated on our 112-core setup. The total computation time per simulation step is 80.17 s on average whereof the computation time for searching and storing the neighbour lists is 8.95 s. A maximum of 210.8 GB of memory is required for the simulation.

Conclusion and Future Work
Neighbour computation and storage is often considered a critical aspect in SPH simulations. This paper contributes to improved data structures and algorithms in that topic with a specific focus on memory consumption. As neighbour lists are responsible for a large portion of the overall memory consumption, we propose to compress these lists with a novel scheme. Accompanied by a novel memory-efficient CLL variant, the overall memory consumption can be significantly reduced, e.g. by a factor of nine compared to the state-of-the-art as indicated in Table 4. The lean data structure of our CLL enables a straightforward implementation that is fully parallelized. While all scenarios generally require less memory with marginal computational overhead, it is particularly remarkable that SPH fluid simulations with about one billion particles can now be processed on a single 24-core PC using only 172 GB of memory. Such scenarios have been rarely computed previously, as they required larger GPU or CPU clusters.
GPU implementations are beyond the scope of the paper, but might be an interesting direction for future research. Since we are able to process 0.2 million particles with 25 simulation steps per real-time second on a 12-core CPU, it is certainly interesting to investigate the respective GPU capabilities. Another interesting research direction might be the reconsideration of VL. Such lists are updated only every few simulation steps, but are much larger than the lists of actual neighbours. As our compression scheme is particularly useful for larger lists, it can be interesting to investigate the potential of compressed VL. [FE08] FLEISSNER F., EBERHARD P.: Parallel load-balanced simulation for short-range interaction particle methods with hierarchical particle grouping based on orthogonal recursive bisection. International Journal for Numerical Methods in Engineering 74, 4 (2008), 531-553.