The impact of columnar file formats on SQL‐on‐hadoop engine performance: A study on ORC and Parquet

Columnar file formats provide an efficient way to store data to be queried by SQL‐on‐Hadoop engines. Related works consider the performance of processing engine and file format together, which makes it impossible to predict their individual impact. In this work, we propose an alternative approach: by executing each file format on the same processing engine, we compare the different file formats as well as their different parameter settings. We apply our strategy to two processing engines, Hive and SparkSQL, and evaluate the performance of two columnar file formats, ORC and Parquet. We use BigBench (TPCx‐BB), a standardized application‐level benchmark for Big Data scenarios. Our experiments confirm that the file format selection and its configuration significantly affect the overall performance. We show that ORC generally performs better on Hive, whereas Parquet achieves best performance with SparkSQL. Using ZLIB compression brings up to 60.2% improvement with ORC, while Parquet achieves up to 7% improvement with Snappy. Exceptions are the queries involving text processing, which do not benefit from using any compression.

Presto, 11 whereas Parquet is first choice for SparkSQL 7 and Impala. 8 A number of studies [19][20][21][22] have investigated and compared the performance of file formats running them on different SQL-on-Hadoop engines. However, because of the different internal engine architectures, these works actually compare the engine together with its file format optimizations. Contrary to those approaches, in this paper, we compare ORC and Parquet file formats while keeping the processing engine fixed. Our main goal is not to tell which engine is better but to understand how the overall performance of an engine is influenced by a change in the file format type or by a different parameter configuration. Figure 1 shows a graphical representation of our research objective.
This study investigates the performance of the ORC and Parquet file formats first in Hive and then in Spark SQL. It also shows how tuning accordingly the file format configurations can influence the overall performance. We perform a series of experiments using the standard BigBench (TPCx-BB) benchmark 23 with a dataset size of 1000 GB, comparing different ORC and Parquet configurations.
The contributions of this paper are as follows: • performance evaluation of ORC and Parquet file formats with their default configuration on both Hive and Spark SQL engines; • performance comparison of ORC and Parquet file formats with two optimized configurations (respectively with and without data compression) in Hive and Spark SQL; • investigate the influence of data compression (Snappy) on the file format performance; • detailed query analysis of representative BigBench queries.
The rest of this paper is organized as follows. Section 2 gives background information and related work. Section 3 presents the experimental setup and the preparation stages. Section 4 discusses benchmark results for Hive, and Section 5 discusses benchmark results for SparkSQL.
Section 6 presents an in-depth analysis of a small subset of representative queries. Section 7 discusses our findings and directions for future work. Finally, Section 8 summarizes our results.

BACKGROUND AND RELATED WORK
In this section, we briefly introduce the main technologies and terms used throughout the paper. Moreover, we present a summary of the most relevant studies investigating data file formats in SQL-on-Hadoop systems.

Hive
Apache Hive 1,2 is a data warehouse infrastructure built on top of Hadoop. Hive was originally developed by Facebook and supports the analysis of large data sets stored on HDFS by queries in an SQL-like declarative query language, called HiveQL. It does not strictly follow the SQL-92 standard. Additionally, natively calling User Defined Functions (UDF) in HiveQL allows to filter data by custom Java or Python scripts. Plugging in custom scripts in HiveQL makes the implementation of natively unsupported statements possible. Hive consists of two core components: the driver and the Metastore. The driver is responsible for accepting HiveQL statements, submitted through the command-line interface (CLI) or the HiveServer, and translating them into jobs that are submitted to the MapReduce engine. 2 This allows users to analyze large data sets without actually having to develop MapReduce programs themselves. The Metastore is the central repository for Hive's metadata and stores all information about the available databases, tables, table columns, column data types, and more. The Metastore uses typically a traditional RDBMS like MySQL to persist the metadata.

Spark SQL
Apache Spark 30 is a cluster computing system that is able to run batch and streaming analysis jobs on data distributed on the cluster. Spark SQL 7 is one of the many high level tools running on top of a Spark cluster. It facilitates the processing of structured data by offering an SQL-like interface and support for HiveQL and Hive UDFs. To achieve this, it defines the concept of DataFrame, a collection of structured records, and a declarative API to manipulate it. Spark SQL includes a specific query optimizer, Catalyst, which improves computation performance thanks to the available information about the data structure. Data can be queried from multiple sources, among which the Hive catalog.

Columnar file formats
Apache ORC 3,31 and Apache Parquet 5 are the most popular and widely used file formats for Big Data analytics and they share many common concepts in their internal design and structure. In this section, we will present the main aspects of columnar file formats in general and their purpose in optimizing query execution. Figure 2 will support the explanation.
The upper right part of Figure 2 shows a table example with four columns (A, B, C, and D). Each column has a different color to better understand the column-oriented storage pattern later. A table is stored in one or more HDFS files, composed of one or more file system blocks.
Table rows are partitioned into groups and stored in the blocks, as shown in the left upper part of Figure 2. Row groups are data units independent from each other and used to split the input data for parallel processing. Data in row groups is not stored row by row, but column by column.
Looking at the first row group detail in Figure 2, we can see that data values for Column A (in red) are stored first, next to it come data values for Column B (in purple), and so on. These portions of column data are usually called column chunks and they allow the filtering of unneeded columns by the query while reading the file. Finally, data in column chunks is split into pages which are the indivisible unit for encoding and compression.
Because values in a column chunk share the same data type, encoding algorithms can achieve a more efficient representation. The encoded output is then compressed with a generic compression algorithm (like ZLIB 32 or Snappy 33 ).
While the column-oriented storage strategy is used to filter out unnecessary column data for a query, indexes are used to filter out row groups.
They are usually placed in front of each row group, so that, by just reading the index, an entire row group can be immediately skipped. This is only possible when the query execution engine uses predicate push-down optimization. 34 Indexes are not shown in Figure 2 for the sake of simplicity. Tables 2 and 3 report all the mentioned general concepts and their specific implementation name for each file    The index contains row statistics and the minimum and maximum value for each column stream. An index can be stored into more pages and a page contains information about 10 000 rows by default. In addition, a bloom filter is included for better row filtering. Finally, a top level index is placed in the file footer.
ORC supports the complete set of data types available in Hive, including the complex types: structs, lists, maps, and unions. Numerical columns are encoded using Run-Length Encoding (RLE) and it is possible to select between the SPEED (default) or COMPRESSION strategy. Dictionary encoding is applied when possible to strings. The latter makes the encoding more lightweight; anyway, this has nothing to do with the generic compression algorithm, which is applied on the encoding output. Default compression algorithm is ZLIB. 32 The metadata in ORC is stored at the end of file (after the file footer) using the Protocol Buffers, 36 providing the ability to add new fields to the table schema without breaking readers. Apache Parquet 5 is an open source columnar storage format using complex nested data structures inspired by the Google Dremel paper. 6 It is a general purpose storage format that can be used or integrated with any data processing framework or engine. Parquet supports efficient compression and encoding schemas on a per-column level. It uses Apache Thrift 37 for the metadata definitions. There are three types of metadata:

Parquet
file metadata, column (chunk) metadata, and page header metadata.  compared SparkSQL with text data and Parquet; Hive-on-MR and Hive-on-Tez with ORC; and AsterixDB with normalized data and nested data.
The results showed that using optimized columnar file formats such as ORC and Parquet significantly improved the performance.

Optimizations.
Recently, many new techniques for optimizing the performance and efficiency of analytical workloads on top of columnar storage formats have been proposed. Bian et al 44 focused on improving the data scan throughput by finding an optimal column layout. By applying efficient column ordering, they reduced the end-to-end query execution time by up to 70% and respectively by additional 12% when using column duplication with less than 5% extra storage.
A different approach of reducing data access through data skipping was presented by Sun et al. 45,46 The authors introduced a four-step framework (workload analysis, augmentation, reduce, and partitioning phases) for data skipping by applying more effective partitioning schema that takes into account the query filters. The results showed 3-7x improvements in the query response time compared to the traditional range partitioning. In their latest work, Sun et al 47 presented a novel hybrid data skipping framework that optimizes the overall query performance by automatically balancing skipping effectiveness and tuple-reconstruction overhead. It allows both horizontal and vertical partitioning of the data, which maximizes the overall query performance. However, none of the above studies investigates the similarities and differences of both formats using a common SQL-on-Hadoop engine as a baseline. Additionally, most of the above comparisons were based on benchmarks using structured data such as the TPC-H 42 and TPC-DS. 43 In our study, we use BigBench, which operates on structured, semi-structured, and unstructured data and has been standardized as TPCx-BB 50 by the TPC committee.

EXPERIMENTAL SETUP
This section describes the hardware and software components and the different configurations used in our experiments.

Hardware configuration
The experiments were performed on a cluster consisting of 4 nodes connected directly through a 1GBit Netgear switch. All 4 nodes are Dell PowerEdge T420 servers. The master node is equipped with 2x Intel Xeon E5-2420 (1.9 GHz) CPUs each with 6 cores, 32 GB of main memory,  Table 5 show relevant cluster parameters and how they were adjusted for the experiments.

BigBench
In order to perform an extensive evaluation of the file formats, it was necessary to use a Big Data benchmark that is able to evaluate the 3Vs characteristics and utilizes structured, semi-structured, and unstructured data. BigBench 23,51 has been proposed and developed to address exactly these needs. It is an end-to-end analytics, application level, and technology agnostic Big Data benchmark that was recently adopted by TPC and released as the TPCx-BB 50 benchmark. Chowdhury et al 52 presented a BigBench implementation for the Hadoop ecosystem, which is available on GitHub 53 and was used for our experiments. The data set is generated on a fictitious product retailer business model, whereas the workload consists of 30 complex queries. 10 queries were taken from the TPC-DS benchmark, 43 whereas the remaining 20 queries were adapted from the McKinsey report. 54 Table 6 summarizes the number and type of queries of this BigBench implementation. The query grouping is an extended version of the one proposed in BigBench specification (see appendix B.3.2 in the work ''TPC -Current Specifications -tpcx-bb v1.2.0'' 55 ). All queries are implemented using Apache Hadoop, 56 Hive, 57 Mahout, 58 and the open Natural Language Processing toolkit. 59 Apache Mahout 58 is a library for quickly creating scalable performing machine learning applications on top of MapReduce, Spark, and similar frameworks.
Apache OpenNLP 59 is a library toolkit for machine learning processing of natural language text.

File format configurations
One of the goals of this study is to investigate how changing the configuration parameters of ORC and Parquet influence their performance.
Therefore, defining the exact configuration parameter values was the first very essential step before starting the experiments.
As shown in Sections 2.3, 2.4, and 2.5, the two file formats share many concepts in their structure design. Our goal is to setup ORC and Parquet with a similar configuration, so to be able to meaningfully compare their performance. We define three test configurations reported in Table 7 and we focus on three parameters: the row group size, the page size, and the compression algorithm. All other parameters are set to their default values for all the three test configurations. As expected, the use of indexes in ORC and of dictionary pages in Parquet is enabled by default. We set the HDFS block size to 256 MB for all of our tests.
The first test configuration is called Default Config and uses the default ORC and Parquet parameters as stated in each file format documentation.
The two formats have very different configurations, especially in the compression parameter. ORC uses ZLIB compression and Parquet does not use any compression, which makes the benchmark results not comparable. Nevertheless, we decided to keep this setup to show the file formats behavior when using them ''out-of-the-box'' and to highlight the performance change after an optimized parameter configuration.
The two other configurations, named Snappy Config and No Compression Config, use respectively the same value for row group and page size while the compression algorithm changes. We increase the row group size to 256 MB to have more sequential reads from disk at the expense of a higher memory usage. 3,44 The page size is set to 1 MB (default for Parquet). A larger page size improves the compression performance and decreases overhead, again, at the expense of a higher memory usage.
The compression parameter is also very important as it determines which general-purpose algorithm such as Snappy, ZLIB, or LZO is used after the file format encoding. 3 Using compression, file readers perform less I/O operations to get the data from disk but more CPU cycles are spent to decompress them. In our configurations, we use the Snappy compression, which is supported by both ORC and Parquet.
Many parameter combinations are possible and would lead to interesting research questions. For example, how the performance for the two file formats change if we vary the row group and page size from 64 MB to 256 MB in steps of 16 MB? Due to time constraints, we limit our tests to these three configurations to achieve a meaningful comparison of the two file formats when running queries on a fixed engine.

Engine configuration
To fully take advantage of columnar file formats design, it is important to configure some settings in the query execution engine. It is important to notice that these parameters do not affect the file generation, but only the query processing.

Code 1 SparkSQL query execution settings.
For SparkSQL, we edit spark-default.conf file by adding the lines shown above. 60,61 The first group of parameters is relevant for Parquet file format. Lines 1 and 2 enable full support for predicate push-down optimizations. Parameter at line 3 enables the usage of Parquet built-in reader and writer for Hive tables, instead of SerDe. 62 Finally, line 4 in the previous file snippet enables predicate push-down also for ORC.

Code 2 Hive query execution settings.
BigBench uses custom settings for Hive, overriding the default. 63 We set the parameters shown above in engineSettings.sql file placed in BigBench Hive subfolder. Lines 1-3 enable the predicate push-down optimization. Line 4 enables the use of format specific indexes for both ORC and Parquet.

Load times and data sizes
The last step before starting the experiments is to generate data using the BigBench data generator based on the PDGF. 64 It generates data in text format that needs to be loaded and converted to a specific file format. The loading process can greatly vary in time depending on the compression used by both ORC and Parquet. BigBench relies only on Hive for creating the schema in the Metastore and storing the file formats data into HDFS. Figure  To better compare measurements in Figure 3, we divide it in four equal quadrants by drawing a line at 350 GB on the Y-axis and 150 minutes on the X-axis. There is an obvious trade-off between time taken to generate the data and the size of the generated data. The points in the down-left quadrant will be the one with optimal performance, which in our case is empty. Therefore, the ORC Default configuration (using ZLIB compression) in the down-right quadrant has the best performance in terms of time and data size, followed by the ORC Snappy configuration with the fastest time generation. Parquet achieves its best performance with the Snappy configuration also in the down-right quadrant.

Performance evaluation
Our plan is to run experiments with all the 30 BigBench queries with the scale factor 1000 of BigBench (1000 GB of data). All tests are repeated three times and we measure the execution time of each query. The averaged values are taken as a representative number. In order to compare the three different configurations defined in Table 7, we need to repeat the three runs for each configuration for both ORC and Parquet. To better compare the file formats and understand the different support on multiple processing engines, we perform the experiments on two   As stated before, the tests are repeated three times for each combination of file format type, configuration, and processing engine. By summing and multiplying, we obtain the Total Execution Time (in hours) needed for each of our configurations. It is reported on the last line of Table 8. By summing all total execution times for the experiments, we get around 701 hours (29.2 days) of testing time.
We defined two metrics to help us compare the different file format configurations. The first metric is called Performance Improvement (PI%) and the second one is called Compression Improvement (CI%). When comparing two execution times, we define the higher time HT (worst) as the baseline and we compute the time difference with the lower time LT (better) as a percentage. This is the amount of saved time between the two executions and we define it as Performance Improvement (PI%): Similarly, we define the Compression Improvement (CI%) as follows: The Performance Improvement (PI%) metric reports with how many % a file format configuration is faster on ORC compared to Parquet or vice verse. It compares the execution times of the different file formats, whereas Compression Improvement (CI%) compares the improvement on a particular file format (Parquet or ORC) when using data compression. The CI% can be negative when the baseline configuration without any compression performs faster than the configurations with compressions (eg, Snappy and ZLIB).

HIVE
In the next subsections, we show results for the Hive processing engine. Each table is dedicated to a query set identified by the query type as described in Table 6. Note that we merged the single MapReduce/UDTF query with MapReduce/Python group. The first column reports the query number, whereas the following columns report execution times for each combination of configuration (Table 7) and file format type. The Total Time of query execution is shown at the end of the table, whereas the very last two rows report the PI% and CI%. The latter is used to show the overall improvement achieved by using data compression, whereas PI% shows the overall performance difference between the two file formats for each configuration.
Query execution time is reported in minutes, and green cells highlight the best time between file formats within each of our configurations (ie, Default, No compression, Snappy). Pairs of cells with the same color (white or light gray) are used to show similar performance between the two file formats: we consider query performance comparable/equal if the difference is lower or equal to 1 minute. The standard deviation in % between the 3 query executions for all Hive queries is under 5%, which indicates that Hive achieves very stable execution behavior.
Furthermore, we discuss the results from the perspective of a practitioner who runs BigBench as a black-box benchmark and analyzes the results. This is also the motivation to include the Default file format configurations in our tests. We want to show what kind of performance you get by using file formats out-of-the-box and how adjusting the configuration parameters can influence the system behavior. Table 9 shows results for the Pure HiveQL query type. Figure 4 depicts the query execution times as visual chart. These queries are fully implemented in HiveQL and, unlike other query types, do not use UDF (User Defined Functions) or external libraries.

Pure HiveQL
Looking at the Default configuration column, we can observe a clear pattern with ORC generally performing better. Only Q15, Q16, and Q22 show similar performance. For a naive user, this can be considered as an expected behavior since, as shown in Table 1, ORC is the default file format in Hive and we expect better optimization with this engine. However, the file formats are configured very differently with ORC using  Table 10 shows results for the MapReduce/Python query type. Figure 5 depicts the query execution times as visual chart. The queries are implemented in HiveQL and enriched with Python programs to execute complex operations within the query definition.

MapReduce/Python
Starting from the Default column, we can observe again that ORC is performing better than Parquet for all the queries. Only Q29 shows a difference of two minutes, but similar performance on the other configurations. We believe that this small difference is only due to noise in  The performance behavior is confirmed on the No compression and Snappy columns, where the PI is reduced respectively to ∼6.8% and ∼5.1%.
Both file formats benefit from a small performance improvement thanks to the introduction of compression. This is highlighted in the blue and orange cells at the end of Table 10. Table 11 shows results for the HiveQL/OpenNLP query type. Figure 6 depicts the query execution times as visual chart. The queries are implemented in HiveQL and enriched with a Java UDF to process natural language data (ie, product reviews).  Table 12 shows results for the HiveQL/Spark MLlib query type. Figure 7 depicts the query execution times as visual chart. HiveQL code is used to extract and prepare input data for machine learning algorithms. Usually, input is stored in a temporary table, processed with Spark MLlib Java library, and the output is stored in a new additional table.

HiveQL/Spark MLlib
Looking at the Default column, we can see that all the queries except Q28 perform better on ORC than Parquet. Q28 performance is not affected by any change in the file format configuration. The PI between the two file formats is 8.1%.
Moving to the No compression column, the previous performance pattern is confirmed but the PI decreases to ∼4.9%. The introduction of Snappy data compression improves the performance for both file formats in a similar amount, respectively ∼2.6% for ORC (blue cells) and ∼2.5% for Parquet (orange cells). Indeed, ORC results as the file format winner also for the Snappy configuration.

Summary
Next, we summarize the major findings of all experiments reported in this section.
1. Using the ''out-of-the-box'' Default file format configuration is not always the optimal choice and depends heavily on the data structure and query type (Section 4.3).

SPARK SQL
In the next sections, we show results for the Spark SQL processing engine, again divided in the four BigBench query types as listed in Table 6. The first column reports the query number, whereas the following columns report execution times for each combination of configuration (Table 7) and file format type. The Total Sum of query execution times is shown at the end of the table, whereas the very last two rows report the PI% and CI%.
Query execution time is reported in seconds, and green cells highlight the best time between file formats within each of our configurations.
Pairs of cells with the same color (white) are used to show similar performance between the two file formats. We consider query performance comparable/equal if the difference is lower or equal to 5% of the highest execution time. The standard deviation in % between the 3 query executions for all SparkSQL queries varies greatly. Queries Q02 and Q30 achieve standard deviations varying between 7% and 16%, which will be explained in Section 5.4. All other queries have standard deviations around 10%, which indicates that SparkSQL is less stable than Hive as reported in the works of Ivanov and Beer. 24,25 We believe this is also due to execution noise in the cluster affecting more SparkSQL times that are generally much shorter compared to Hive ones.
Similar to the Hive engine, we will first discuss the performance of the Default configuration. Next, we observe No compression and Snappy configurations to get insights and better understand the performance behavior. Generally, query performance on Spark showed more variable results. This is also due to shorter execution times with respect to Hive and therefore bigger noise influence. The query source code for the Spark experiments is exactly the same as the one in Hive, and therefore, we use the same query groupings in this section. Table 13 reports results for the Pure HiveQL query type. Figure 8 depicts the query execution times as visual chart. In general, Parquet seems to perform better in all the three configurations.

Pure HiveQL
Looking at the Default configuration, Parquet performs better than ORC for many of the queries despite the former does not use compression and the latter uses ZLIB. Some queries perform similarly on both file formats (ie, Q07, Q11, and Q21), whereas other achieve considerably lower execution times with ORC (ie, Q16, Q22, and Q24). In particular, Q22 shows a huge improvement of ∼79.3% when run on ORC with the Default configuration compared to the No compression and Snappy configurations. Further experiments with ZLIB compression need to be performed in order to understand this behavior, but this is out of the scope of our work.
Moving to the No compression column, it is more clear that, when the two file formats are under the same conditions, Parquet is the optimal choice. At the same time, a group of six queries show similar performance between the two formats (ie, Q14, Q15, Q21, Q22, Q23, and Q24).
The PI% is significant with Parquet taking ∼21.5% less time than ORC to execute the full set of queries.
On the Snappy column, we can observe that both file formats get a performance benefit from the application of data compression. Parquet gets an improvement of ∼ 5.5% (orange cells), whereas ORC gets a more modest improvement of ∼1.5%. Parquet clearly demonstrates its  8 Hive Results (in minutes) for the Pure HiveQL queries with all three file format configurations performance advantages over ORC in all queries with the exceptions of Q07 and Q16 that perform similarly. The PI% between the two formats is even more significant than the No compression case, reaching ∼24.5%. Table 14 reports results for the MapReduce/Python query type. Figure 9 depicts the query execution times as visual chart.  First, we need to say that Q02 and Q30 are highly unstable on Spark when using the Parquet file format. The two queries randomly fail with ''Out of Memory'' errors, but in many cases, they complete successfully. To obtain the results in Table 14, we ran Q02 and Q30 more than three times with Parquet until we were able to get 3 successful run completions. We report the results for completeness, but we do not consider them trustworthy for our discussion. We did not observe this behavior with any of the ORC configurations.  Table 15 reports results for the HiveQL/OpenNLP query type. Figure 10 depicts the query execution times as visual chart.

HiveQL/OpenNLP
On the Default column, execution time is shorter on Parquet because it uses uncompressed data as default parameter. Q19 shows no change in performance and this behavior is confirmed also in the other columns of the   Table 16 reports results for the HiveQL/Spark MLlib query type. Figure 11 depicts the query execution times as visual chart.

HiveQL/Spark MLlib
Looking at the Default configuration, ORC achieves a better Total Time, with a PI% of ∼ 36.9%. Q20 and Q25 perform better on Parquet, whereas Q28 does not show performance differences for any of the configurations.
Moving to the No compression column, ORC is still the best option in terms of total time but the PI% between the two formats is similar.
The most unusual behavior has Q05. It takes 618 seconds on ORC Default (using ZLIB compression) compared to 1842 seconds on ORC No compression. All other queries do not seem to be affected by the configuration changes.
With Snappy data compression, the two formats show similar performance in all queries, with the exception of Q20 and Q25 that perform better on Parquet. While Parquet gets a significant improvement thanks to data compression (∼7.0%, orange cells), ORC gets a minimal improvement in the Total Time. Only queries Q05 and Q25 perform better when using ORC with Snappy instead of No Compression, whereas all the remaining queries (Q20, Q26, Q28) result in longer execution times.  11 Hive Results (in minutes) for the HiveQL/SparkMLlib queries with all three file format configurations

Summary
Next, we summarize the major findings of all experiments reported in this section.
1. Using the ''out-of-the-box'' Default file format configuration is not always the optimal choice and depends heavily on the data structure and query type (Section 5.3).

Parquet generally achieves best performance with the Spark engine except for HiveQL/Spark MLlib (Section 5.4) and MapReduce/Python
(Section 5.2) query types, where behavior is unclear.
3. In most cases, using Snappy compression improves the performance on both file format with Spark, except for the OpenNLP query type, where we observe negative influence (Section 5.3). In particular, the Compression Improvement (CI%) for Parquet is negative with a value of −25.7%.

IN-DEPTH QUERY ANALYSIS
In this section, we perform a deeper query investigation with the goal to identify the causes of the varying performance behavior reported in the previous sections. Due to time limitations, we select one representative query from each of the four query types in BigBench. These are Q08 (MapReduce/Python), Q10 (HiveQL/OpenNLP), Q12 (Pure HiveQL), and Q25 (HiveQL/Spark MLlib).
We re-executed the queries on SparkSQL configured with No compression and Snappy for ORC and Parquet. Each of them shows different behavior with respect to the other queries in the same group. We analyze them with the help of the Spark History Server metrics, the query source code, and the resource utilization data. Figure 12 shows all the information we collected about a query (ie, Q08 with No Compression on ORC) by looking at the Spark History Server and source code. In particular, we collect the Spark tasks, stages, operations, as well as Input and Output data sizes. A Spark query can be split into a number of jobs. A Spark job is a set of tasks that is resulting from Spark operation (action).
A task is an individual unit of work that is executed on one executor/container. A Spark stage is a set of tasks in a job that can be executed In addition, we collected performance metrics using the Performance Analysis Tool (PAT) 67 by Intel. The tool collects data from all cluster nodes introducing minimal overhead on the benchmark workload. At the end, it aggregates them and immediately produces Excel files with visual charts, which will be used in our analysis. All generated charts are available on GitHub 65 under file selected-queries-resource.pdf. In the next subsections, we discuss our findings for each of the selected queries.

BigBench Q08 (MapReduce/Python)
Q08 investigates the effect of review reading by customers on the sales revenue in a specific time and for a specific product category. The query can be divided in two distinct phases. In the first phase, three temporary tables are created and stored in plain text format. The first table contains rows representing dates for the desired time period. The second stores web sessions of users who read product reviews. A Python program is used to build this. The third and last table stores sales in the desired time period. In the second phase, the temporary tables are combined to get the final result. To better understand the file format impact, we focus primary on the first phase where the data is retrieved from the file format structure.

Comparing ORC and Parquet
As shown in Table 14, Q08 performs better on Parquet for both No Compression and Snappy configurations. Similarly for both configurations, Parquet reads less input data for the key execution stages compared to ORC.     The use of Snappy compression brings some improvements only for Parquet. Figure 14 shows a huge drop in I/O requests (blue peaks in the yellow square). This is the expected behavior when using compression: to reduce the number of disk accesses while retrieving the same data.

BigBench Q10 (HiveQL/OpenNLP)
Q10 performs sentiment analysis on the product reviews by classifying them as positive or negative. The output also reports the words that lead to the classification. The query does not create any temporary table, but it works directly on data stored in columnar file formats. The sentiment extraction is realized with a Java UDF embedded in the query code.

Comparing ORC and Parquet
This query type is particularly interesting because it generally performs better with No compression configuration, as shown in Table 15. This is an unexpected behavior, as the data compression usually brings benefit to the performance by reducing the I/O accesses. In the specific case of Q10, we get better performance on Parquet when using uncompressed format. On the other hand, the query performs better with Snappy compression on ORC, but the performance improvement is less relevant.  For what concerns Parquet, we hypothesize that Q10 is performing better on No compression configuration because it is harder to efficiently compress unstructured data. Anyway, the behavior is not confirmed when using ORC. Because of this discrepancy, we look into the resource utilization. Figure 15 shows the disk requests for the No compression configuration. We can immediately notice that the disk is underutilized with both file formats. The same happens with Snappy configurations and also for the disk bandwidth utilization and CPU utilization, which we do not report in this document for the sake of space.

No Compression Configuration
In this case, the motivation for the query performance behavior is unclear. If the benchmark user is specifically interested in this query type, he should further investigate Q18, Q19, Q27. If no relevant insight is found, the UDF code should be checked for implementation issues.

ORC
The Snappy configuration takes 37 min. and total input of 5 GB compared to 34 min. and total input of 3.4 GB with No Compression configuration.
Beside the smaller input data, using compression results in worse performance than with No Compression.

Parquet
The Snappy configuration takes 51 min. and 4.7 GB total Input compared to 29 min. and 5 GB total input with the No Compression configuration.
Both configurations have 4 Spark stages, but stage 1 with Snappy is not performing the same operations as already mentioned above. Beside the smaller input data, using compression results in worse performance than with No Compression.

BigBench Q12 (Pure HiveQL)
Q12 searches for customers who viewed products online and then bought a product from the same category in a physical store in the next three months. No temporary tables are created to fulfill the query, meaning that all the relevant data is retrieved directly from the columnar file format.
The whole query period should be relevant for the analysis.

Parquet
Using Snappy compression, the query takes 1.4 min and 1.5 GB compared to No Compression configuration with 1.7 min. and 2.9 GB. Overall, using compression with Parquet slightly improves the performance. As the reader can notice, measurements in the run with PAT show a different result with respect to Table 13.
To understand the aforementioned discrepancy, we look at the PAT utilization charts. The disk requests and bandwidth utilization profiles are very similar between No compression and Snappy configurations. We do not report them here since we cannot get any useful insight from them. Figure 16 shows the CPU utilization for Q12 on the Parquet file format for both configurations. We observe a generally higher utilization with Snappy configuration. This is usually expected since data retrieved from disk must be uncompressed, wasting valuable CPU time. As observed in all the other queries in Table 13, despite the need of extra computation, data compression is beneficial for the performance since it reduces the number of disk accesses. The unexpected behavior of Q12 can be explained with the fact that the query reads small amount of data from the disk; the benefit from using compression is not appreciable. We believe that the discrepancy between Table 13 and PAT execution is due to noise, and the difference between No Compression and Snappy configuration should be considered negligible.

BigBench Q25 (HiveQL/Spark MLlib)
Q25 groups customers based on a set of shopping dimensions. To achieve this, it uses a k-means clustering algorithm 68 implemented using the Spark MLlib. The query is split into two phases: the first phase creates a temporary table to prepare data for the clustering algorithm, whereas the second phase simply executes a Spark job and stores results. As usual, the temporary table is stored in plain text making only the first phase relevant for our performance analysis on file formats.

Summary
1. The most important lesson learned is that changing the file format influences the overall engine behavior. The functions used to retrieve data are different: with ORC, Spark prefers to use HiveTableScan, whereas, with Parquet, it prefers to use FileScanRDD. This emphasizes the importance of the file format choice when using a specific engine.
2. In many cases, we observed that both formats vary in the number of Spark stages, tasks, and operations, which they execute for the same HiveQL query code.
3. Generally, the introduction of Snappy compression improves the query performance for both file formats by reading less data from disk.
However, the query performance is not only influenced by the amount of input data from disk. For Q25, Parquet shows better performance than ORC while still reading more data.

4.
The HiveQL/OpenNLP Q10 shows abnormal behavior. Cluster resources are under utilized and Snappy compression worsens the performance.
While the latter can be caused by the unstructured data type, the cluster under utilization can be caused by flawed workload code in the benchmark. Further investigation in the Java UDF source code is necessary.

DISCUSSIONS AND FUTURE WORK
In this paper we evaluated the ORC and Parquet file formats on top of Hive and Spark by varying the format configurations and comparing their performance. However, both Parquet and ORC are developed as independent, general-purpose data file formats and as such are easier to integrate into other engines. In Table 1, at the beginning of this paper, we listed multiple engines which already support both formats. This also means that each of the engines offers a different level of integration for both formats as well as different default parameter configurations.
For example, Spark SQL uses Parquet with Snappy compression 69 as the default storage format, contrary to Hive which uses Parquet with no compression as a default configuration.
In addition to that, every engine focuses on implementing optimizations using its primary storage file format leaving all other supported formats behind. Therefore, when benchmarking one should explicitly differentiate between ''benchmarking SQL-on-Hadoop engine performance'' and ''benchmarking data file formats performance''. Comparing the performance of different engines does not immediately mean that one is also comparing the storage file format used by the engine.

What are the best practices, methodologies and metrics when comparing different file formats?
Based on our experimental results and benchmarking methodologies that we implemented to compare the file formats performance in this study (in particular ORC and Parquet), we assembled a list of steps that can be used as a best practices guide when testing file formats on a new SQL-on-Hadoop engine: • Make sure the new engine offers support for the file formats under test (ORC and Parquet).
• Choose a benchmark or test workload representing your use case and suitable for the comparison.
• Set the file format configurations accordingly (with or without compression).
• Generate the data and make sure that it is consistent with the file configuration using the file format (ORC and Parquet) tools.
• Perform the experiments (at least 3 times) and calculate the average execution times.
• Compare the time differences from the two file formats using the PI % and CI % metrics and make conclusions.
• Select queries compatible with your specific use-case. Execute them while collecting resource utilization data and perform in-depth query evaluation to spot bottlenecks and problems.
One important evaluation factor mentioned above is the benchmark used for stress testing the engines. As reviewed by the related work (Section 2.6), TPC-H and TPC-DS are the most common choices for benchmarking both engine and file format performance. In this paper, we utilize the BigBench benchmark as it is currently the only standard benchmark including structured, unstructured and semi-structured data type as well as machine learning and text processing workloads.

Is BigBench suitable for file format comparisons?
Based on our experimental results (Sections 4, 5 and 6), we can conclude that BigBench is a good choice for comparing file formats on SQL-on-Hadoop engines mainly for two reasons: (1) structured and unstructured data influence the query performance in particularly in combination with compression (Q10 in Section 6.2); and (2) the BigBench variety of 30 different workloads (use cases) divided in four categories based on implementation type.
However, many other questions around benchmarking file formats still remain open: • Is there a need for a specialized micro-benchmark to better investigate the file format features?
• If yes, what should this benchmark include in terms of data types and operations?
• What are the file format features that such a benchmark should stress (for example block size, compression etc.)?
As a future work, we plan to investigate these questions. For example, further insights can be obtained by running benchmarks on the same processing engine and file format while changing the file format parameters. This would help to better distinguish the influence of each architecture component on the query performance. New metrics can be added to the in-depth query analysis, like the amount of network traffic exchanged by each node in the cluster on relevant TCP and UDP ports, as shown in, 70 to spot bottlenecks and unbalanced workloads.
Resource utilization data collection and the comparison of graphs and execution plans can be standardized and integrated into BigBench for better usability. Data analysis and visualization can then be included in a graphical user interface, like Apache Hue 71 which is highly customizable.
We also plan to evaluate columnar file formats with other benchmarks and dataset types, leading to their adoption in other applications. For example, ORC and Parquet together with Hive or SparkSQL can be used to store and query time-series data, like data coming from sensors.
Software architectures for sensor data analysis can select these technologies for their batch layer to achieve better access performance with respect to simple plain text files on HDFS, or to replace complex NoSQL systems. 72,73 However, dedicated benchmarks are needed to ensure the suitability of columnar file formats for range-based queries, that are commonly used to access time-series data.

CONCLUSIONS
To the best of our knowledge, this is the first study that evaluates the ORC and Parquet file formats on Hive and SparkSQL using the BigBench benchmark as a popular representative of Big Data workloads. From our benchmark results, we can observe that it is important to separate the file format evaluation from the engine. Both components have great influence on the workload and a comparison of their combination is meaningless because we cannot tell what component is really causing a change in performance. In Section 6, we showed how a different file format selection changes the Spark execution behavior. In particular, different functions are used to retrieve data respectively from ORC or Parquet files. We believe that our benchmark methodology in which we keep the processing engine fixed while changing the file format is correct.
At the same time, overall performance on the same engine is greatly influenced by the file format parameters. The default configurations of ORC and Parquet are extremely different and their direct performance comparison can lead to misleading results. Therefore, the file format selection cannot be naively based on the engine preference stated by documentation or by quick-and-dirty benchmarks. A careful understanding of file format parameters, especially the use of data compression, is necessary to make the optimal choice.
Once this is setup, a careful understanding of the benchmark workload is necessary. We showed that ORC generally achieves best performance with the Hive engine (Section 4), whereas Parquet generally achieves best performance with the Spark engine (Section 5). However, this is not true for all BigBench query types with significant exceptions, as shown in Sections 4.3, 5.4, and 5.2. Users should identify and select groups of queries in BigBench that best emulate their use case and then evaluate performance. This will not only reduce time needed to execute benchmarks but will also give more valuable results. In this respect, BigBench is extremely flexible: in this work, we ran the whole set of queries but the user can select and run only a subset of them.
Similarly, the use of data compression is not advised for all query types. In most cases, using Snappy compression improves the performance on both file formats and both engines, except for the OpenNLP query type, where we observe negative influence with both engines (Sections 4.3 and 5.3).
In the future, we plan to perform more experiments with different configurations by modifying other file format parameters and investigate their influence. Furthermore, we plan to understand the cause for instability of queries Q02 and Q30 on Spark (Section 5.2) as well as the resource under utilization found with Q10 (Section 6.2).