Parallel CPU–GPU computing technique for discrete element method

The efficiency of the simulations with the discrete element method (DEM) is significantly improved using a novel computational strategy. The new method is developed with a focus on platforms equipped with multi‐core central processing units (CPU) and general‐purpose graphics processing units (GPU). The DEM calculations are performed in parallel on the CPU and on the GPU using pre‐calculated Verlet lists with a posteriori analysis of their consistency. The operations related to the search for possible contacts are performed on the CPU, whereas the processing of interactions, and integration of motion, are executed on the GPU. Performance analysis done for various types of tasks has shown that the new method allows to significantly decrease the average computational time and to utilize available computational resources more efficiently compared to the sequential CPU–GPU execution mode. Furthermore, due to more efficient calculations, the overall energy requirement for the proposed strategy does not exceed the demand for conventional sequential CPU–GPU computations.

is to parallelize the computations. The relative simplicity of this approach stems from the nature of the DEM method: some of the basic operations, namely the calculation of forces between each pair of objects and the integration of motion for each object, are independent operations and therefore can easily be performed concurrently for each object or pair of objects.
Initially, the parallelization strategies in open-source or commercial DEM software packages were focused on the central processing units (CPU) implementation only. To speed-up calculations, a static or dynamic domain decomposition was applied, and calculations were executed on multiprocessor or multi-core systems. For this purpose, the message passing interface (MPI), OpenMP interface, or hybrid approaches 17 were applied.
However, due to the increase in the computational power of graphics processing units (GPU) and the development of application programming interfaces such as CUDA or OpenCL, more and more GPU-based DEM algorithms and software environments are emerging. For example, He et al. 18 have proposed a GPU-based implementation of a multi-grid algorithm for contact detection and applied it for modeling of compaction of particles with a wide size distribution. Kureck et al. 15 analyzed the behavior of non-spherical tablets during coating. To perform modeling of large-scale systems, combined MPI and GPU-based calculations were applied by different groups. 14,16,19 One of the challenges related to the DEM simulations on the GPU is the efficient utilization of available computing resources. During the calculations on GPU, the CPU remains idle, and powerful multi-core processors are not used. To improve the efficiency of calculations, cooperative CPU-GPU computing techniques can be applied. Lee et al. 20 proposed a cooperative heterogeneous computing paradigm in which, depending on the distribution of the workload at runtime, a part of the initially GPU-belonging work is executed on the CPU. Wrede and Ernsting 21 developed a library for simultaneous CPU-GPU computing for data parallel skeletons. Navarro et al. 22 developed a partitioning strategy for parallel loops and applied it for particle simulations. A more comprehensive survey on different other techniques of heterogeneous computing and their applications can be found in Mittal and Vetter 23 and in Raju and Chiplunkar. 24 Nowadays, in almost all GPU implementations of the DEM, most of the calculations, including a time-consuming contact detection, 14,18,25 are executed on the GPU. In some rare cases, hybrid CPU-GPU computing is performed, and a part of operations is transferred to the CPU. 9,10 Nonetheless, the DEM calculations are performed in the pipeline mode either on the CPU or on the GPU.
In this contribution, a new strategy for the concurrent usage of CPU and GPU for the DEM simulations is proposed. Compared to the previously used techniques, 20,21 this method is based on the high-level algorithmic parallelization of calculations.

The main principle of the DEM
The DEM is an explicit mesh-free approach for modeling granular materials. Each iteration of DEM can be subdivided into three main steps: • Step 1: contacts detection; • Step 2: calculation of interactions; • Step 3: integration of motion.
In the first step, the positions of all objects (particles, droplets, walls, etc.) are analyzed, and interacting objects are identified. Due to a large number of modeled objects, this can be a very time-consuming task. Additional complexity arises when it is necessary to treat non-spherical particles, particles with a large size ratio, 26 or to consider periodic boundary conditions. Moreover, the contact detection between a particle and a wall is associated with additional complexity, since it is necessary to distinguish between three different types of contacts and to perform their post-processing. 27 In the second step, the interactions between objects are calculated. There is a wide variety of contact models, starting from relatively simple linear interaction models consisting of several operations of multiplication or summation, ending with complex non-linear models. 8 Compared to the contact detection and the calculation of interactions, the integration of motion in the third step is a relatively simple operation.
Here new velocities and positions of objects are calculated.
It should be noted, that the three steps listed above are the most commonly used DEM scheme. However, depending on the type of problem being solved, additional steps can be included. For example, during the simulation of the breakage process, the positioning of fragments must be performed. 28,29

Contact detection algorithm
The novel parallelization strategy has been implemented in the component-based modeling framework MUSEN. 30,31 To perform contact detection, it applies a method based on the Verlet lists. 32 This is a widely used approach for particle-based simulations. 33,34 Its main idea is as follows: for each particle P i , a list of all neighboring particles or walls located within a certain cut-off distance (Verlet distance) L v is found and stored. Thereafter, to detect all the contacts of P i it is enough to take into account and analyze only the potential contacts from this list. Moreover, there is no need to update the list in each simulation time step, which allows reducing the amount of calculations. The condition for including a potential contact between particles P i and P j in the list is: where R i and R j are the contact radii of particles, x i (t v ) and x j (t v ) are their positions at time point t v , when the Verlet list is updated. The list of potential contacts is considered to be consistent at any time point t > t v if the following condition is satisfied: The frequency of updating the Verlet list is strongly dependent on the dynamics of objects and on the threshold value L v . On the one hand, small values of L v allow reducing the size of Verlet lists. On the other hand, increasing L v decreases the updating frequency. Moreover, one of the additional limiting factors is the amount of available global GPU memory. To avoid time-consuming memory reallocation on the GPU on each iteration, memory is allocated once for all potential contacts each time the Verlet list is updated. As a result, for very large systems, L v will be limited by the amount of available GPU memory. In the MUSEN framework, the default value of L v is calculated as twice the minimum particle radius in the system.
To reduce computational costs and to avoid analysis between all possible pairs of objects, the linked-cell algorithm 35 was also applied. The whole simulation domain is divided into cubic cells of a known size and all particles are distributed among them according to their positions. The cell size L C is defined depending on the maximum particle radius R max and should be larger or equal to 2R max + L v . Due to this, to find all the possible contacts of P i , it is enough to consider only the objects in the current and all adjacent cells. Moreover, to avoid duplicate comparison of the same cell pairs, only limited combinations can be processed, which reduces the number of cell pairs to be considered from 27 to only 14. 36 The efficiency of the linked-cell algorithm can be negatively affected by the presence of particles with wide size distributions. Thus, to improve this method, it was extended by a multigrid approach. 26,37 Its main idea is to omit the evaluation of contacts between small particles on the coarse grids. The method implies the introduction of several grid levels j with cell sizes L j c varying from 2R max + L v to 2R min + L v . On each level j, only the particles with certain sizes within the limits U j max and U j min are considered, where Each cell i on each level j maintains two lists of particles: the main list M j i for particles with U j min < R i ≤ U j max and the secondary one S j i for particles with R i ≤ U j min . Then considering two neighboring cells A and B, Equation (1) is evaluated for pairs P . Thus, at each level, only contacts between pairs of "large-large" and "large-small" particles are processed. All "small-small" contacts are processed at the lower levels of the hierarchy with more suitable grid sizes.
A general idea of the particle contact detection algorithm is presented in Algorithm 1.

Parallel CPU-GPU calculations
The contact detection using Verlet lists (Algorithm 1) involves intensive dynamic memory allocations, as well as a large number of heterogeneous operations. Both of these aspects are critical for GPU execution and may lead to a significant reduction in computing performance. It should be noted, that there exist alternative strategies of contact detection on GPU without generation of Verlet lists. 14,16,25 However, Verlet lists offer a set of significant advantages and, in this work, we propose a new simulation strategy based on their use.
The conventional way to perform Verlet-based DEM simulations on hybrid CPU-GPU architecture is to perform calculations sequentially, as is shown in Figure 1A. Here, the detection of potential contacts and the generation of Verlet lists is performed on the CPU (A 1 ). The generated lists are transferred to the GPU (A 2 ) as fixed-size arrays, which allows avoiding any memory reallocation on the GPU. In further steps, these lists are used for iterative calculations of interactions (A 4 ) and integration of motion. These calculations are repeated as long as the condition in Equation (2) forall grid levels j do  ...

end while
As it can be seen from Figure 1A, there are idle time intervals when the GPU is in the standby mode, waiting for new lists to be generated on the CPU. As a result, the efficiency of the sequential approach depends on the average clock time CPU , needed to regenerate Verlet lists (A 1 ). If CPU is much shorter than the average time GPU needed to perform DEM calculations on the GPU (A 4 ), the sequential execution approach reveals high efficiency. However, if CPU and GPU have the same order of magnitude, the fraction of idle intervals increases and hence overall efficiency decreases.
To improve calculation performance, the parallel execution mode ( Figure 1B) with two-stage instruction pipelining and task-level parallelism is proposed. The whole computational work is divided into two stages, which can be performed relatively independently: pre-calculating contacts in Verlet lists and using them to calculate the forces and motion of objects. In this case, the CPU-GPU system can be considered as a two-stage pipeline, where the first instruction is to calculate the Verlet list on the CPU, and the second one is to use it on the GPU. The pipelined commands processing allows maximizing the load of both computational units during the entire simulation time. Such a static distribution of tasks between the two devices eliminates complex synchronization concepts, making the algorithm more general and applicable for a wide range of problems.
The main idea of the algorithm is to start pre-calculation of new Verlet lists on the CPU (B 5 ) even before the DEM calculations with the current list (B 6,7 ) are finished. During the execution of the algorithm, the initial Verlet list Λ 1 is first calculated on the CPU (B 1 ), while the GPU is idle. At this stage, precalculated Λ 2 may already be inconsistent with the new coordinates of the objects. Therefore, after data transfer (B 9 ), an additional analysis is needed to verify whether Equation (2) is met for Λ 2 (B 10 ). If Λ 2 is not consistent, then the semi-parallel stage of the algorithm (B 1-10 ) starts from the beginning, where the calculation of the new Verlet list will be based on the current positions of the objects (B 1 ).
On the other hand, if Λ 2 is still valid, a parallel stage of the algorithm starts. Λ 2 is transferred to the GPU (B 11 ), where the calculations proceed using Λ 2 (B 13,14 ). Parallel to this, the CPU starts the calculation of the next list Λ* (B 12 ), so (B 12 ) and (B 13,14 ) are executed concurrently. Here one should pay attention to the fact that the calculation of the Verlet list Λ 2 is performed for the state of the system in the middle of the calculations on Λ 1 (moment (B 4 ) on the diagram). Thus, by the time of switching to the list Λ 2 (B 9-11 ), objects on the scene may have already traveled some distance relative to their position at point (B 4 ). Because of this, condition (B 13 ) may cease to be fulfilled earlier than it was for (B 6,7 ), which means that the GPU calculations in the parallel stage (B 13,14 ) of the algorithm may be shorter compared to the semi-parallel stage (B 6,7 ). In the case when the average speed of objects on the scene does not change over time, the difference between them will be about 2 times.
Once the calculations using Λ 2 are finished, synchronization (B 15 ) and data transfer (B 16 ) occur. After that, condition (B 10 ) is checked again, and, depending on it, either the semi-parallel (B 1-10 ) or parallel (B 10-16 ) part of the algorithm is repeated. It should be noted that steps (B 8,9 ) and (B 15,16 ) are equivalent and the latter were added to the diagram only for better readability.
As it was mentioned above, the efficiency of one or another execution mode depends on the average time needed for the contact detection on the CPU ( CPU ) and for the remaining part of DEM operations on the GPU ( GPU ). Both of these characteristics are influenced by many factors, and their a priori estimation is a very challenging task. They are highly dependent on the hardware resources, on the type of problem being solved, and even on the state of the simulated scene at any given moment. Let q be a ratio calculated as: where seq CPU and seq GPU are the average clock time needed for CPU and GPU calculations in the sequential execution mode ( Figure 1A). As stated earlier, in the ideal case, the number of the GPU simulation time steps at each parallel stage will be 2 times less compared to the sequential execution mode. As a result, two cases can be distinguished: 1. q ≤ 0.5-application of the parallel execution mode fully loads the GPU; 2. q > 0.5-application of the parallel execution mode fully loads the CPU.
The differences between these two cases are demonstrated in Figure 2, where the exemplary time diagrams for both calculation modes and for two different hardware configurations are shown schematically. It is assumed that independent of the current state of the system, the following conditions are met: the time needed to generate Verlet list on the CPU ( CPU ) is constant and the GPU calculation time is directly proportional to the Verlet distance L v . For clarity, the diagrams show the case when objects move at a constant speed-then GPU in parallel execution mode is reduced by exactly 2 times compared to the sequential mode. In real simulations, this reduction will depend on the dynamics of the process. For the assumed ideal case, CPU (B 12 ) does not depend on the algorithm and therefore remains constant. Moreover, to simplify the representation, the time required for data transfer and for other supplementary operations is not included.
The first two timelines in Figure 2 depict the differences between the sequential and the parallel approaches, for the case when q = 0.25. The calculations are started on the CPU with the generation of the first Verlet list Λ 1 , which is then transferred to the GPU (operations A 1 and A 2 in  used to calculate the current list of possible contacts Λ 2 , the current GPU calculation cycle will be shorter than the initial one. When both devices complete their iterations, the algorithm repeats. It can be observed that by applying this algorithm (q < 0.5) one can ensure that the GPU is never idle, but is constantly busy with computing operations. If during the data transfer (B 9, 11 ) it turns out that the condition from Equation (2) is not satisfied, the timing diagram resets into the initial state.
If q = 0.75, the sequential approach works the same as in the previous case, but the overall simulation time is longer since the detection of contacts takes more time (Figure 2). The main difference of the parallel strategy is that the calculation of the new Verlet (B 5 ) cannot be finished before the end of the integration step on the GPU (B 6,7 ). Thus, only a partial overlap of the activity periods of the CPU and the GPU is possible. The GPU idle intervals remain, but their size is reduced. Nevertheless, the simulation time is reduced compared to the sequential approach, since the CPU idle time is eliminated.
Define performance gain from using the parallel execution mode G as where seq = seq CPU + seq GPU + seq exch is the execution time of one CPU-GPU stage in the sequential mode and par is the time required to calculate the same number of simulation steps in the parallel mode. Using (6) Since the semi-parallel stage usually needs to be performed very rarely, it is neglected here. If exch ≪ CPU + GPU , the time needed for data exchange can also be neglected. Then using (5), the maximum theoretical gain G max , which can be reached by migrating from the sequential to the parallel execution mode, can be estimated as: Thus, even under ideal conditions, the maximum theoretical gain is 50%, and it is reached at q = 0.5. The achieved G must decrease with an increase in exch and with any change in q. Moreover, the performance gain turns even negative if q > 1, meaning that the algorithm loses its effectiveness if seq CPU > seq GPU . Worth noting that q, despite depending on the case being simulated, is not a constant intrinsic characteristic of that case. It depends on many factors and changes over time during the simulation. Therefore, q and hence G und G max are defined for only one CPU-GPU stage and are constantly changing. Nevertheless, in real calculations, G max can be measured (possibly averaged over a period of time) and effectively used as a criterion for the applicability of the algorithm in each specific case at any time point of the simulation.

Parallel calculations on CPU
Almost all calculation steps of the main algorithm depicted in Figure 1  Applied to the proposed algorithm, two main operations during the calculation of Verlet lists are performed in parallel using ThreadPool. These are the distribution of particles over the grids and the contact detection between particles in adjacent grid cells.
Verlet list Λ 2 and Λ* must be calculated by the CPU in parallel with the GPU. Since a function that evaluates the list should not return a value, std::thread is also used to execute it. When it is required to start the calculation, a std::thread object is created with a callable (a function for calculating the Verlet list), associated with a tread of execution, and immediately started asynchronously with respect to the GPU.

Parallel calculations on GPU
The Verlet list, calculated on the CPU, contains a list of potential contacts that can take place but are not necessarily active at the current time point.
During the simulation (Figure 1: A 4 , B 7 , B 14 ), the number of active contacts is constantly changing. However, to avoid frequent memory reallocations, a predefined data structure with the information about collisions is created for each potential contact. Regardless of whether a particular contact takes place or not, such structure is maintained until the next recalculation of the Verlet list. This structure consists of: • pre-calculated parameters: some parameters that do not change during contact, like equivalent Young's modulus; • incremental values: values calculated in previous steps that should be maintained for further calculations, like tangential overlap; • results: data fields in which simulation results are stored.
When the data structures with possible contacts are generated, the calculation of interactions and integration of motion starts. Here, three consecutive operations are performed in each step: • active contacts detection: the Verlet list with all potential contacts is analyzed and, depending on the current object coordinates, a shorter list of currently active collisions is created; • calculation of particle-particle or particle-wall interactions: for each contact, the corresponding contact model is executed; • calculation of solid or liquid bonds: the corresponding contact model is executed for each bond; The GPU calculations are implemented using the CUDA computing platform. For the parallel execution of each calculation step, a constant number of blocks and threads per block is specified to run the computational kernels. Both values are usually defined depending on the GPU architecture. For all GPU configurations used in this study, the number of blocks was set equal to the number of multiprocessors available on the device (cudaDeviceProp::multiProcessorCount), and the number of threads per block was set to 256.
All kernels run on a single stream and therefore execute synchronously with respect to the GPU. The data copy operations between the compute units are synchronous (cudaMemcpy) from the point of view of the main CPU thread. To organize data exchange, pinned memory (allocated with cudaMallocHost) is used.

Investigated case studies
To analyze the efficiency of the new approach, several DEM simulations have been performed. Overall, 12 different case studies have been analyzed. In the first case study, the dynamics of grinding balls and grinding medium in a ball mill was simulated. In the last decades, the DEM was widely applied for this type of apparatuses. 38 In the second case study, the behavior of a material in a high shear granulator 39 is investigated. Case 3 represents the dynamic impact of a large particle on a granular bed of loose particles. Three different types of mixers are simulated in case studies 4, 5, and 6. A detailed description of scenes 4 and 5 can be found in Dosta et al., 40 Lee et al., 41 and McGuire et al. 3 In Case 6, the free-fall motion of particles in a static mixer is modeled. Studies 7 and 8 investigate the process of silo emptying. In Case 7, spherical particles, and in Case 8, half spheres are modeled. To represent the non-spherical shape of the particles, the bonded-particle model (BPM) was used, 8 where each particle of complex shape is represented as a set of primary particles connected with solid bonds. In Case 9, the vertical vibrations of rod-like particles in a cylindrical container are analyzed. Here the BPM is also used to represent the shapes of objects. The three points bending test applied to biopolymer aerogels is modeled in case study 10. 6 In Case 11, a random motion of particles in a cubic volume with periodic boundary conditions 42 was studied. The last case study 12 simulates the process of filling the conical container with particles under the influence of gravitational force.
All the investigated case studies consisted of a relatively large number of discrete objects of the order of 10 5 and more. Using the developed parallel CPU-GPU strategy, as well as the general application of the GPU, for scenes with a small number of particles is inefficient. Simulations consisting of less than 10 4 objects can usually be more effectively calculated on a multicore CPU rather than on a GPU.

Performance analysis
To get reliable statistics, the case studies shown in Figure 3 have been modeled for a relatively long interval of process time on five different hardware configurations. The simulation end time has been chosen so that the calculation clock time on HD1 was at least 1 hour.
Each simulation has been repeated 10 times on each configuration. In the scope of this work, the following hardware was used for running case studies: • HD1: Intel Core i7-7700K, NVidia GeForce GTX 1080 Ti All CPU and GPU were operated at their reference frequencies in all configurations; all GPU were connected via PCIe 3.0 ×16 interface.
To carry out the measurements described below, the source code was instrumented by measuring the execution time of specific code sections using the std::chrono::steady_clock::now() function from the C++ standard library. The results obtained generally indicate that the new algorithm makes it possible to significantly improve computational performance. However, the improvement is highly dependent on various conditions, and in some cases, the new method can even lead to a performance drop. Therefore, a prerequisite for its efficient use is an online time measurement and automatic switching between sequential and parallel execution modes.
To ensure the effectiveness of the implementation of the proposed algorithm, the obtained performance gain for all simulated problems, calculated according to Equation (7) on each hardware architecture was compared to the maximum theoretical gain (Equation 9). To calculate G, total calculation times in sequential and parallel execution modes were used. Time parameters CPU and GPU , needed to calculate q and G max , were measured and averaged for each case and hardware configuration. The obtained results are shown in Figure 5. Each point on the The obtained average performance gain in comparison with the maximum possible gain diagram represents the averaged result of the same 10 simulations (as in Figure 4) of a particular test case on a given hardware configuration.
As the ideal case, G = G max is assumed. The data demonstrate that the current implementation of the proposed algorithm reveals high efficiency.
The average deviation of the obtained performance gain from the calculated G max is 3.56%. Averaging the results by the case, gives a maximum of 8.91% for Case 2; averaging by the hardware configuration results in a maximum of 5.06% for HD4; the absolute maximum is 13.61% (Case 5 on HD4).
For all cases, the averaged measured value q (5) was less than 0.5, therefore the equality G max = q is valid for all results in the diagram.
Overall, there are two main conclusions from the results. First, the current implementation does not contain major flaws but has potential for improvement. Second, the proposed equation for the approximate calculation of G max (9) is suitable for a rough estimation of the possible performance gain from the application of the new algorithm.
The transfer time was not considered when calculating the theoretical benefit. However, this parameter can have a significant impact on the performance of the algorithm. Therefore, this influence was additionally investigated during the simulation of all case studies. It was found that the use of the new method leads to an increase in the number of data exchange operations (CPU to GPU and back) almost 2 times: from 3094 to 6173, on average for all case studies. This growth is expected and is associated with a twofold reduction in the GPU cycle. In general, one can observe a tendency that it is meaningful to use the new simulation strategy only for DEM scenes with a relatively large number of modeled objects. In Figure 6, the influence of the number of objects on the performance gain is shown. Here, according to the obtained performance gain, it is proposed to classify the influence and to distinguish four main groups: no improvement (≤ 1%), noticeable improvement (from 1% to 5%), moderate improvement (from 5% to 15%), significant improvement (≥ 15%). The majority of the simulated DEM scenes exhibit noticeable or better improvement, with four cases gaining more than 5% performance. As it can be seen from the results, there is a general trend towards an increase in the obtained performance gain with an increase in the number of modeled objects.
One of the main factors here is the constant threshold distance L v (Equation 1), which for all case studies, except Case 7 and Case 5, was equal to the diameter of the smallest particle. It is expected that enhancement of this approach and the usage of an auto-adjustable threshold, which will be calculated at runtime, can lead to a significant increase in efficiency.
F I G U R E 6 Average performance gain versus the total number of modeled objects To further explore the differences in performance gains between test cases from different groups, time diagrams for Case 1 (noticeable improvement) and Case 6 (significant improvement) are provided. The diagrams shown in Figures 7 and 8 were obtained using the NVIDIA Nsight Systems 2020.4.3.7 profiling tool. The sampling rate of the process tree was set to the default value of 1 kHz.
Case 1 was sampled over 1.2 ms of process time (Figure 7). During this time interval, 11 calculations of Verlet lists took place in the sequential execution mode. The average pause in the work of the GPU was about 145 ms. By using the proposed parallel algorithm, these short intervals of idle time were almost eliminated. As expected, the number of calculations of the Verlet lists doubled to 21. It can also be seen that the impact of data exchange was insignificant.
Case 6 was sampled while calculating 2 ms process time (Figure 8). When comparing Case 1 to the timeline for Case 6, the difference in performance gain becomes apparent. In sequential mode, there were 16 calculations of Verlet lists, each lasting about 3.13 s. The entire simulation lastes 130 s, which means that the GPU was idle for more than a third of the whole time. Therefore, the application of the parallel algorithm was much more efficient here. The number of calculations of the Verlet list, as in the previous case, has doubled, and the GPU downtime has been reduced by more than four times. In the case of parallel execution, the number of Verlet list calculations performed on the CPU is about twice that of the sequential mode.
Thus, it is expected that the use of the concurrent strategy will lead to an increase in energy consumption. To check this, the change in power consumption of a computer with an HD4 configuration was measured during the transition from sequential to parallel mode. The measurements were performed using the electricity consumption meter Voltcraft SEM4500. Each test scene was simulated for 1 hour of real time, during which the total energy consumption of the system was measured. The results of these studies are summarized in Figure 9. As expected, power consumption in the parallel mode increases due to the influence of two factors: more frequent recalculation of Verlet lists on the CPU and shorter idle time of the GPU. It can also be seen that, in general, an increase in calculation performance directly correlates with an increase in energy consumption.
Thus, due to the achieved performance gain, the overall energy requirements for the new strategy do not exceed or are even less than the power consumption for the sequential mode. The new approach is more energy efficient in 75% of the cases studied. The scenes with the worst energy efficiency are at the same time the cases with the worst performance gain, which again indicates that the new approach is not applicable for these simulations.
F I G U R E 7 Timeline diagram for Case 1 on HD1 for sequential and parallel execution mode F I G U R E 8 Timeline diagram for Case 6 on HD1 for sequential and parallel execution mode F I G U R E 9 Dependence of the increase in power consumption on the obtained performance gain for the hardware configuration HD4

CONCLUSIONS
In this contribution, the concurrent execution strategy has been proposed to improve the efficiency of DEM simulations on hybrid CPU-GPU architectures. The novel approach was developed with a focus on multi-core personal computers equipped with a general-purpose graphics processing unit. To analyze the efficiency of the new method, the calculation performance for various case studies, such as emptying a silo, mixing in a paddle mixer, dynamic impact, have been analyzed.
The obtained results have shown that for most of the investigated scenes, the novel approach allows achieving a noticeable performance gain.
Due to the parallel co-execution, the average calculation time for the tested cases has been reduced by up to 37% compared to the sequential execution using hybrid CPU-GPU architectures. Furthermore, the new execution strategy allows reducing energy consumption when simulating DEM scenes consisting of a large number of discrete objects. Although the performance gain strongly depends on the type of problem being solved, on the number of modeled objects, and on the hardware configuration, we have shown the effectiveness of the new approach and high potential for further research and developments in this area. One of the possible improvements that can reduce the load imbalance between the CPU and the GPU is the implementation of the automatically adjustable threshold distance used to populate the Verlet lists, which would be calculated depending on the properties of each specific scene.

DATA AVAILABILITY STATEMENT
The data that support the findings of this study are available from the corresponding author upon reasonable request.