Characterization of data compression across CPU platforms and accelerators

The ever increasing amount of generated data makes it more and more beneficial to utilize compression to trade computations for data movement and reduced storage requirements. Lately, dedicated accelerators have been introduced to offload compression tasks from the main processor. However, research is lacking when it comes to the system costs for incorporating compression. This is especially true for the influence of the CPU platform and accelerators on the compression. This work will show that for general‐purpose lossless compression algorithms following can be recommended: (1) snappy for high throughput, but low compression ratio; (2) zstandard level 2 for moderate throughput and compression ratio; (3) xz level 5 for low throughput, but high compression ratio. And it will show that the selected platforms (ARM, IBM or Intel) have no influence on the algorithm's performance. Furthermore, it will show that the accelerator's zlib implementation achieves a comparable compression ratio as zlib level 2 on a CPU, while having up to 17× the throughput and utilizing over 80% less CPU resources. This suggests that the overhead of offloading compression is limited but present. Overall, this work will allow system designers to identify deployment opportunities for compression while considering integration constraints.


INTRODUCTION
The ever increasing amount of generated data leads to a growing gap between processing speed and I/O. As a result it becomes more and more beneficial to utilize data compression to trade computation time for data movement and storage requirements. The wish to use compression raises the questions of the optimal compression algorithm for the given tasks, how to integrate it into the system and how it will influence the overall system design and performance.
Compression algorithms can be classified into lossless und lossy compression. Lossless compression is an one-to-one mapping between original data and compressed data allowing to accurately reconstruct compressed data. Contrary to this, lossy compression is a many-to-one mapping and thus cannot promise accurate reconstruction. This work only analyses lossless compression algorithms, as it is often the only choice in scientific research communities, like the High Energy Physics (HEP) community. In such communities accurate data is necessary to have valid scientific findings. Hence, lossy compression and its potentially inaccurate reconstruction are not wanted. Aside from this classification, different compression algorithms offer different trade-offs between compression ratio and computation time. The system design on hardware level includes the choice of the CPU platform and the usage of accelerators.
Even though compression becomes increasingly attractive, little research is available about the costs of integrating compression into existing system and system design choices. Therefore, this work characterizes the performance of software-based general-purpose lossless compression algorithms on three different CPU platforms: ARM Armv8 aarch64, IBM POWER8 ppc64le and Intel Xeon x86_64; and it compares the performance of commercially available compression accelerators to the software-based general-purpose lossless compression algorithms on an Intel Xeon x86_64 server. With this the following contributions are made: • Presentation of benchmarks for single-threaded, multi-threaded and compression accelerators • Performance and power analysis of general-purpose lossless compression algorithms on different CPU platforms (ARM Armv8 aarch64, IBM POWER8 ppc64le and Intel Xeon x86_64) • Performance comparison of commercially available compression accelerators with the above mentioned software-based compression algorithms • Proposal for selection of compression algorithms depending on problem requirements These contributions allow, in particular, system designers to identify deployment opportunities given existing compute systems, data sources, and integration constraints.

RELATED WORKS
To diminish the implications on (network) data movements and storage, massive utilization of compression can be found, among others, in communities working with databases, graph computations or computer graphics. For these communities a variety of works exists which addresses compression optimizations by utilizing hardware-related features like SIMD and GPGPUs. [1][2][3][4] Usage of general-purpose compression algorithms can be found in, for example, mobile devices, 5 video compression 6 or for communication between robots. 7 And such algorithms are also compared against newly developed algorithms, like Brotli. 8 Furthermore, lossy and lossless compression algorithms were analyzed for data of cosmic particles on an Intel system. 9 Similar to HEP data sets, which are used in this work, researching cosmological particles creates huge amounts of data under real-time constraints.
Moreover, a multitude of research has been conducted to efficiently use Field Programmable Gate Array (FPGA) technology, as they excel at integer computations which are also highly used in compression. Works here range from developing FPGA-specific compression algorithms, 10 to integrating the FPGA compression hardware into existing systems, 11,12 to implementing well-known compression algorithms, like gzip, and comparing the achieved performance to other systems (CPU, GPGPU). For example, some work explored CPUs, FPGAs, and CPU-FPGA co-design for LZ77 accelerations, 13 while other analyzed the hardware acceleration capabilities of the IBM PowerEN processor for zlib. 14 Regarding FPGAs, Abdelfattah et al. 15 and Qiao et al. 16 implemented their own deflate version on an FPGA.
However, the large majority of those works focuses on the compression performance in terms of throughput and compression ratio, but not on the resulting integration costs for the entire system. One of those few studies is Matai et al. 17 which analyzed the energy efficiency of canonical Huffman coding for Intel, ARM and FPGA, with Intel being the least efficient and the FPGA the most efficient energy-wise.
Overall, little to no published work can be found covering the topic of system integration costs in the context of compression accelerators.

Compression algorithms
The selected compression algorithms were lossless general-purpose compression algorithms which did not rely on any pre-existent knowledge about the data. The selection was based on their usage in related works and general popularity, including recent developments. Most compression algorithms allow tuning the performance for compression ratio * or throughput † by setting the compression level. Generally speaking, a higher compression level refers to a higher compression ratio, but also a longer computation time. For this study, compression algorithms run solely on CPU are called software-based (sw-based). In Table 1 all selected sw-based algorithms and their respective compression levels are listed. For them, a pre-study was done to select compression levels where significant changes in either compression ratio or throughput occurred (see Figure 1). * compression ratio = (uncompressed data)/(compressed data) † throughput = (uncompressed data)/(run time) here, the level in bzip2 does not refer to the compression function utilized, but the block size of the input data (between 100 and 900 kB). For this study the largest block size (level 9) was chosen.
Lz4 19 is a LZ77 byte-oriented algorithm. The sub-type Lz4_HC offers 13 levels to increase the compression ratio and automatically builds the frame (header and footer) for the compressed payload. For this study Lz4_HC level 2 was selected. It has a low compression ratio, but a high throughput. For the rest of this study Lz4_HC is referred to as lz4.
Snappy 20 is an algorithm based on LZ77. It was developed at Google with the goal to have a short computation time. It decompresses data significantly faster than it compresses data. Snappy has no tuning parameters.
Xz 21 is a popular tool which offers multiple compression algorithms. The primary algorithm used is LZMA2, which has 9 levels and can achieve a high compression ratio, but requires a long computation time. For this study level 1 and level 5 were selected.
Zlib 22 utilizes the deflate algorithm. 23 Deflate is an algorithm based on LZ77, followed by Huffman coding. Zlib is a popular library used, among others, for ZIP and gzip compression. Zlib offers 9 compression levels, of which levels 2 and 4 were selected. This allows to compare against the zlib implementation of the accelerators used in this study. 24 is an algorithm based on LZ77, in combination with fast Finite State Entropy and Huffman coding. It was developed by Facebook for real-time compression. Zstd offers 23 levels. Above compression level 20 the configuration differs to achieve a higher compression ratio, but this increases significantly the memory usage. For this study level 2 and level 21 were selected.

Data sets
Depending on the part within this study, a selection of the input data was chosen from up to four standard compression corpora and six HEP data sets listed in Table 2. To be representative for big data, compression corpora with at least 200 MB were preferred. One exception was the calgary corpus, as it is one of the most famous compression corpora and thus used in many related works. Tared into a single file, it only sums up to 3.1 MB.
To conform with the benchmark requirements, it was copied multiple times to reach the 150 MB. The HEP data came from three different CERN experiments. Several files are listed as uncompressed which is the preferred state in this study to be able to analyze the data's full potential. However, this does not mean that the experiments store their production data like this. Many of the experiments apply some form of compression, for example, zero-compression which is similar to the representation of sparse matrices in coordinate format. At CERN, particles are accelerated in a vacuum very close to the speed of light. They are then collided and analyzed by the experiments. The data created consists of geographical locations where particles were registered. Due to the nature of quantum mechanics this process is close to a random number generator. As compression takes advantage of patterns within data, a high compression ratio cannot be expected for them.

Compression benchmarks
A benchmark for each scenario was created: one to analyze the performance of the accelerators, and one each to analyze the single-stream and multi-stream behavior of the sw-based compression. All benchmarks were built upon the same principles to reduce any benchmark-related bias.
The layout of both multi-stream benchmarks, sw-based and for the accelerator, are shown in Figure 2.
The performance measured was the sustained throughput over the compression function itself. It consists of retrieving the data from RAM, compressing and writing it back to RAM. The tasks of initially loading the data into RAM and writing it out to, for example, the disk, was excluded.
The following metrics are measured: average compression ratio, total size of uncompressed data, total size of compressed data and the exact run time of each compression thread.

Software-based benchmarks
Both, the single-stream and multi-stream benchmarks first loaded the entire data set into RAM before calling the streaming compression function of the compression library.
The single-stream benchmark consists of a single process which executes the compression and runs exclusively on the server. It was used to evaluate and select the compression levels of each algorithm for this study. And it was used to evaluate the stability of throughput and compression ratio when reducing the data size of the input. Based on this and to account for memory limitations on the different servers when using the multi-stream benchmark, the chunk size was set to 150 MB. Larger data sets are iterated multiple times via round-robin, each time compressing 150 MB blocks until the time constraint is reached.
F I G U R E 2 Benchmark layout: The mainThread coordinates the progress of the compressThread. The compressThread continuously compresses the input and records the results. For the accelerator benchmark the additional queueThread (in blue) provides the input data chunk-wise to the compressThread The multi-stream benchmark consists of two types of threads: the mainThread, which is the thread started by the user and which coordinates the orchestration of all other threads, and multiple compressThreads that execute the compression. Hwloc is used to bind the com-pressThreads equally split onto the NUMA nodes. Because of memory limitation reasons the input data set was loaded only once for each NUMA node and shared with the node's compressThreads. All other resources were uniquely assigned and private to each process (e.g., the output buffer).

Accelerator benchmark
The accelerator benchmark is a multi-stream benchmark which, compared to the sw-based benchmark, has one additional thread type: queueThread. This overlap between the benchmarks can also be seen in Figure 2. The queueThread provides the data set chunk-wise to the com-pressThreads via a thread-safe queue. This is necessary as each compression stream needed access to its own allocated input data in order to be able to communicate with the accelerator. To optimize the load for each thread, compressThread was able to run multiple compression streams in a single thread. The compression function used busy waiting to retrieve the results of the streams from the accelerator. Parameters that could be modified were the number of each type of thread, the input buffer size, the size of the thread-safe queue, the number of compression streams within each compressThread, and the type of algorithm and its compression level.
Overall, the main difference to the sw-based benchmark was that each compression stream needed its own allocated memory for the input data, while the sw-based benchmark moved a pointer on the shared memory of the input data.

Performance instrumentation and measurement
Multiple tools were used to measure the CPU and RAM power consumption and memory bandwidth. The CPU and RAM power consumption was measured for ARM using ipmitool remotely to access the baseboard management controller (BMC), IBM using ipmitool locally and for Intel with the Processor Counter Monitor or likwid. 28 Furthermore, for Intel likwid was used to measure the read, write and total memory bandwidth.

Servers
Three servers were selected, each belonging to a different platform: x86_64 (Intel), ARMv8 aarch64 (ARM), and POWER8 ppc64le (IBM). For Intel an older mid-tier server E5-2650 v3 was used. For IBM a similar old, but high-tier server 8335-GTB was used. Only for ARM the Cavium ThunderX2, a state-of-the-art high tier server, was used. The hardware specifications of the servers are listed in Table 3.

Compression accelerators
Two commercially available accelerators were considered: Intel QuickAssist 8970 (QAT), 29 and AHA378. 30 QAT is an ASIC which offers encryption and compression acceleration up to 100 Gb/s. AHA378 is an FPGA accelerator with up to 80 Gb/s, solely designed for compression.
Unfortunately, the QAT exhibited instabilities in performance and thus did not deliver any consistent results. Therefore, this work will focus on the AHA378 accelerator.

PERFORMANCE OF DIFFERENT CPU PLATFORMS
This section analyses the characteristics of the different sw-based compression algorithms on the three common CPU platforms: ARM (aarch64), IBM (ppc64le), and Intel (x86_64). Following behaviors were characterized: cross-platform, performance per core, scalability, robustness and power consumption. In this section only the HEP data sets were used.

Cross-platform behavior
For the cross-platform behavior the best performance of the multi-stream benchmark was compared over all servers. The best performance might differ for each compression algorithm and level in the number of compression streams executed in parallel. Figure 3 shows that independent of the platform, all servers created a similar performance pattern for the same data set ALICE 2. And Figure 4 shows that the performance pattern differed for different data sets. For example, bzip2 level 9 had better performance and better compression ratio for ALICE 2 when compared to LHCb 1.

Performance per core
To compare the performance between different platforms, a metric was designed to describe the performance per core. It is described by Throughput per CPU clock speed * #Real Cores. Values for the CPU clock speed were taken from the manufacturer specifications as the CPU clock speed was not measured during the benchmarks.
In Figure 5 two plots, one for ATLAS 2 and one for LHCb 1, highlight the performance increase per core compared to the weakest server, the Intel E5-2650 v3. The relative performance increase differed depending on the compression algorithm and level. The Intel E5-2650 v3 had in all the cases the weakest performance per core, followed by ARM ThunderX2, and IBM 8335-GTB which had the best performance. The only exception was zstd level 2. Depending on the data set, it performed equal or worse on the ARM ThunderX2 compared to the Intel E5-2650 v3. Furthermore, the IBM 8335-GTB had a notable higher performance increase for bzip2 level 9, xz level 1 and xz level 5 compared to the ARM ThunderX2.

Scalability
To evaluate the performance scaling both sw-based benchmarks were run with a chunk size of 150 MB. The best performance of the multi-stream benchmark was selected per compression algorithm and level. This might result in different thread configurations for the same server. Table 4 lists the performance increase normalized by the number of real cores between the multi-stream and the single-stream benchmark.
The value 1.0 implies that multi-stream and single-stream have same performance per real core, and thus having no increased performance when using SMT.

Robustness
In this part of the study, the number of parallel compression streams which achieved maximum throughput was determined. In Figure 6 a 2D heat map is shown for the evaluation on each platform depending on compression algorithm (left column) and input data set (right column). For every TA B L E 4 Scalability factor normalized by number of real cores: best multi-stream compared to single-stream  combination of algorithm and data set each thread configuration was counted which was within 5% of the maximum performance. Brighter colors on the heat map present a smaller variance in the thread configurations for the optimal performance and thus describing a more robust platform.
The most robust environment provided the Intel E5-2650 v3. It preferred to use all virtual cores (SMT2). The next robust server was the ARM ThunderX2, which also preferred to use all virtual cores (SMT4), except for snappy which performed better using only SMT3. The least robust server was the IBM 8355-GTB. It had contradicting configuration preferences depending on which kind of robustness was evaluated. For the robustness based on the algorithms it preferred SMT6 (96 cores), with the exception of zstd level 21 which preferred SMT7 (112 cores). While for the robustness based on the data sets it preferred SMT4 (64 cores). Neither of the robustness studies preferred to use all virtual cores (SMT8, 128 cores).

Power consumption
The CPU and RAM power consumption was evaluated on the multi-stream benchmark. All the values refer to the power consumption of a single socket. It was not possible to measure the RAM power consumption on the ARM ThunderX2, thus the cross-platform comparison will only be based on the CPU power consumption.
The overall maximum CPU and RAM power consumption for each server is listed in Table 5. Both, IBM 8335-GTB and Intel E5-2650 v3, achieved the highest power consumption on the data set ALICE 3. They achieved with 99% and 92% a power consumption close to the TDP, while ARM ThunderX2 only reached about 77% of its TDP for the highest power consumption on the data set LHCb 1. Grading the algorithms by throughput and energy efficiency, the following order can be established (first being best) snappy > lz4 lvl 2 > zstd lvl 2 > zlib lvl 2 > bzip2 lvl 9 > xz lvl 1 > xz lvl 5 > zstd lvl 21 Furthermore, following was observed for the scaling of the energy efficiency F I G U R E 7 Energy efficiency for LHCb 1. The ARM ThunderX2 has up to a factor 5 better performance per Joule consumed than the Intel E5-2650 v3, which slightly performed better than the IBM 8335-GTB Overall, the biggest CPU power consumer was lz4 level 2 for Intel E5-2650 v3 and for ARM ThunderX, and xz level 1 and xz level 5 for IBM 8335-GTB. xz on IBM 8335-GTB also had the largest increase in RAM power consumption, while at the same time having the best CPU energy efficiency increase. For the other platforms no such connection was visible. The lowest increase in power consumption was achieved on the Intel E5-2650 v3: zstd level 21 and zlib level 2 for the CPU and zlib level 2 and lz4 level 2 for the RAM.

PERFORMANCE OF THE ACCELERATOR
This section analyzes the characteristics of compression accelerator AHA378 in comparison to sw-based compression algorithms on the Intel E5-2650 v3. The data sets used were all compression corpora and the HEP data sets ALICE 2, ATLAS 1 and LHCb 1.

Accelerator results
The accelerator achieved a sustained throughput of 75-81 Gb/s, which is at least 94% of the advertised throughput of 80 Gb/s. This advertised throughput was even surpassed for the calgary corpus. The configuration achieving this was using numactl to bind the entire benchmark on the same NUMA node to which the accelerator was connected to. The input parameters were set to a chunk size of 2 MB, 500 elements in the queue, 3 queueThreads, 4 compressThreads, and 12 compression streams per compressThread. All results for throughput and compression ratio are listed in Table 6, for memory bandwidth listed in Table 7 and for power consumption listed in Table 8.
TA B L E 6 Compression ratio and throughput for the accelerator and sw-based algorithms Note: For the sw-based algorithms only the best and worst performing algorithms are listed. Additionally, zlib level 2 is listed to be able to compare to the accelerator.
The lowest throughput and compression ratio was achieved by LHCb 1 with 75 Gb/s and a compression ratio of 1.18. The highest compression ratio was achieved by Silesia corpus with a ratio of 2.94 and a throughput of 79 Gb/s. Overall, the data-dependent change of the throughput was less then 4%.
The memory bandwidth was between 22 and 30 GiB/s. ALICE, ATLAS, and LHCb had the highest memory bandwidth, and calgary corpus the lowest memory bandwidth. Exactly the opposite correlation could be found when looking at the memory bandwidth percentage used for reads.
Between 51% and 63% of all memory accesses were reads. In general there was a correlation between having a high compression ratio and a high percentage of reads. A higher compression ratio means a lower amount of output to be written and as a result the percentage of reads increases relative to the writes. Only calgary corpus defied this in the final measurement taken. It had the lowest read percentage (51 %), while at the same time the second highest compression ratio. However, multiple measurements showed that in general the read percentage of calgary corpus was with 59% on average significantly higher.
Similar to the throughput, the power consumption was stable and negligibly influenced by the data sets. For the socket to which the accelerator was connected to the CPU consumed 58 W and the RAM consumed 7 W. While running the benchmark, the power consumption of the entire server was 266 W.

Result comparison
In this section the results of the accelerator are compared to the results of the sw-based benchmark. The sw-based benchmark was run with 40 compressThreads, using all 40 virtual cores of the server.
The performance of throughput and compression ratio is listed in Table 6. The accelerator was compared against the sw-based zlib level 2. Even though zlib level 4 was closer in compression ratio (7% better), the penalty in the throughput was considered too large (13%-37% worse). Depending on the data set, the accelerator achieved an 8-17 times increased throughput compared to zlib level 2 when utilizing all virtual cores. A different algorithm, zstd level 2, had a similar compression ratio as zlib level 2, but a 2 to 3 times higher throughput.
snappy and lz4 achieved the highest throughput: For the data set PROTEINS they achieved the same throughput as the accelerator. However, overall the throughput was very data-dependent and their compression ratio was only 65%-82% of zlib level 2. The lowest throughput had zstd level 21 and in few cases xz level 5, but these were also the algorithms with the highest compression ratio. For the 150 MB version of the calgary corpus both algorithms, zstd level 21 and xz level 5, were able to recognize the entire calgary corpus sequence as a reoccurring pattern, and thus achieving a compression ratio of over 170.
The memory bandwidth is listed in Table 7 and was for the sw-based algorithms between 1 and 40 GiB/s. In all cases had zlib level 2 the lowest bandwidth with 1-2 GiB/s. While xz level 5 had the highest bandwidth for the majority of data sets. Otherwise it was zstd level 21.
The percentage of memory bandwidth being read access was between 49% and 77%. zlib level 2 had here a similar split as the accelerator.
Listed in Table 8, the CPU power consumption of the sw-based algorithms was between 72 and 103 W for each socket, with snappy and zstd level 21 having the lowest CPU power consumption and zstd level 2 and zlib level 2 having the highest CPU power consumption.
zlib level 2 used between 98 and 101 W, which is 30 W more CPU power consumption than the accelerator. The RAM power consumption of sw-based algorithms was between 4 and 11 W. zlib level 2 had the lowest and xz level 5 had the highest RAM power consumption. There was a direct correlation between RAM power consumption and the memory bandwidth measured for the sw-based algorithms.
However, for the same RAM power consumption of 7 W, the accelerator benchmark achieved twice the memory bandwidth-the value reached when 9 W were consumed by the sw-based algorithms. The server's power consumption was with 266 W for the accelerator within the range also achieved by the different sw-based algorithms (259-301 W).

DISCUSSION
This work consists of two sections, the first characterized the performance of general-purpose lossless compression algorithms on the three common CPU platforms: ARM (aarch64), IBM (ppc64le), and Intel (x86_64); and the second characterized the performance of compression accelerators comparing it to sw-based compression algorithms on the Intel (x86_64) platform.

Performance of different CPU platforms
It was observed that substantial compression ratios were achieved in spite of inhomogeneous data and already existing compression methods. Furthermore, all three platforms created a similar performance pattern for the same data set. This can be explained by those compression libraries not being performance optimized for a certain platform, but by being optimized for compression accuracy and reliability.
For nearly all cases, best performance was achieved by utilizing all virtual cores. Only the IBM 8335-GTB was inconsistent with this behavior and had problems finding a robust configuration. When looking at the performance per core that IBM 8335-GTB outperformed both, ARM ThunderX2 and Intel E5-2650 v3. However, more interesting is that for more than half of the algorithms the relative performance increase between Intel E5-2650 v3 and IBM 8335-GTB was not even close to twice as much, even though the IBM 8335-GTB was newer, and had close to twice the CPU power consumption while having less real cores. This might indicate that most compression libraries do not use the full potential of the IBM 8335-GTB.
Looking at the scalability, Intel E5-2650 v3 scaled with 0.97 sightly negative. This might indicate that the single core performance boosted by Turbo, increasing the clock frequency, is slightly stronger for a single core than using SMT2 to reduce pipeline stalls. ARM ThunderX2 reached a scaling factor of 1.14, while 8335-GTB reached a scaling factor of 1.87. This is more than four times better than ARM ThunderX2, suggesting that the POWER8 architecture has a better usage efficiency of the SMT modes than the ARMv8.

Performance of the accelerator
The AHA378 accelerator achieved up to 17 times the throughput compared to the sw-based equivalent of the compression algorithm (zlib level 2) while utilizing over 80% less resources. The throughput achieved by the accelerator was at least 94% of the advertised throughput. The accelerator's throughput was stable independent of the data set (<4% variation), while the sw-based zlib level 2 had a throughput reduction of up to 55% depending on the data set. This suggests that the logic on the accelerator could support a higher throughput but some (bandwidth) bottleneck prevents this. During the study it was shown that NUMA binding is important to maintain a stable and good performance.
Without NUMA binding the same data input could have a decreased throughput of up to 10 Gb/s for the entire benchmark.
To compare this work with the FPGA zlib implementations of Abdelfattah et al. 15 and Qiao et al. 16  The pipeline-setup of the benchmarks is a realistic model of the planned data acquisition pipeline for the LHCb experiment at CERN. For each captured particle collision the data will be filtered for interesting events, compressed in-flight and sent to storage. The compression will be done by directly streaming the data to the accelerator after the data was filtered. Thus, the data will at all times reside in the server's main memory. Only after the compression will the data be sent to permanent storage, for example, HDDs.

Implications
In general, there is no surprise that the accelerator is faster than the sw-based algorithms while using significantly less resources. However, from the perspective of the system design it is important to understand the underlying requirements to select the optimal setup when it comes to platform and accelerator integration.
It was shown that when it comes to the sw-based compression algorithms there is no preference of any of the platforms: ARM, IBM or Intel.
And for the accelerator it was shown that the maximum throughput can be achieved by utilizing only 10% (= 4 compressThreads) of the CPU resources if the data is already residing in memory. However, while compute requirements were low, the memory bandwidth requirements of the accelerator were notable. It used around 80% of the memory bandwidth of one socket ‡ .
Hence, compute-intensive tasks would be the most compatible collocation on a system which uses accelerators for compression. This would allow to utilize the 90% of unused CPU time. To improve and maintain a stable performance NUMA binding should be used. Possibly, the best choice would be a task which utilizes the compression as post-processing step, making sure that the data is already available in memory. Alternatively, two accelerators could be put into one server, which would then saturate the memory bandwidth of both sockets.
As the accelerator benchmark used busy-waiting, it is strongly believed that the CPU overhead can still be reduced further. Future work will focus on implementing a more complex communication model based on polling to reduce the CPU overhead, while also integrating other tasks on the same server.
Based on the information about time and space implications of the compression algorithms and levels provided by this study, it would be possible to create a scheduler that can select appropriate compression algorithms during run time while still adhering to real-time constraints of for example, Data Acquisition Systems.
Moreover, Figure 1 can be used as reference to help with the selection of a fitting compression algorithm. Following general suggestions can be made: (1) Use snappy for high throughput, but low compression ratio; (2) Use zstd level 2 for moderate throughput and compression ratio; (3) Use xz level 5 for low throughput, but high compression ratio.
However, as compression is highly data-dependent the authors recommend to conduct similar research before basing expensive hardware and software decisions on this.

CONCLUSION
This work provides insights on performance characteristics and system integration costs for general-purpose lossless compression algorithms on different CPU platforms and accelerators. For this, the performance on different CPU platforms and the performance of accelerators in comparison ‡ Including the allocation done in queueThreads to software-based compression algorithms is evaluated. It is shown that the performance of the compression libraries is independent of the evaluated architectures, ARM Armv8 aarch64, IBM POWER8 ppc64le and Intel Xeon x86_64. The ARM ThunderX2 had the best and greenest performance, but was also the most recent server being released. While the IBM 8335-GTB had the best performance per core. Compared to the software-based zlib the accelerator achieved equal compression ratio, but with up to 17 times increased throughput and utilizing over 80% less CPU resources. Considering that compression is often only one element of a data processing pipeline, this overhead cannot be neglected. Moreover, the results support that commercially available compression accelerator are a feasible choice to harvest the advantages of FPGAs of providing high throughput for well-known, reliable compression algorithms without intensive FPGA development effort. For software-based compression algorithms, following can be suggested: use snappy for high-throughput, but low compression ratio; (2) zstd level 2 for moderate throughput and compression ratio; and (3) xz level 5 for low throughput, but high compression ratio.

ACKNOWLEDGMENT
Open Access Funding provided by European Organization for Nuclear Research.

DATA AVAILABILITY STATEMENT
The data that support the findings of this study are available from the corresponding author upon reasonable request.