The future of OLAP table storage is not Iceberg

Jul 30, 2024

Introduction

In the last few years, table catalogs based on Apache Iceberg, Apache Hudi, or Delta Lake table formats (a.k.a. table catalog specifications) for data warehousing, embodying the lakehouse architecture, have been widely adopted by data teams and celebrated as a success of the open data management ecosystem.

All these table formats entirely rely on object storage services (like Amazon S3) or distributed file systems (like HDFS) for storing both the data partition files (usually in Parquet format) and the metadata about the tables and the data partitions. This alleviates table catalogs from worrying about data and metadata durability, read availability, and write availability.

However, as I detail below in this post, this way to modularise table catalogs is fundamentally limited in performance and cost efficiency relative to the architecture in which table storage is integrated with query (pre-)processing and data ingestion functionalities, such as in BigQuery, ClickHouse, Apache Doris, Apache Druid, Firebolt, Apache Pinot, Redshift, StarRocks, and other data warehouses. These limitations are especially pronounced in high query volume or low query latency (“online”) use cases (OLAP).

These inefficiencies increase the costs (and/or query and data ingestion latency) for all data teams that use table catalogs based on Iceberg, Hudi, or Delta Lake as “single gateways” to their data, either as higher service provider bills or increased provision requirements in self-managed setups. There are also additional drawbacks:

In SaaS setups, the data function is exposed to service failure and service quality risks from more distinct service providers.
In self-managed setups, data teams have to manage three separate systems: the object storage (or distributed FS), the table catalog, and the data processing engine, instead of two: the data warehouse that implements all the table catalog, data ingestion, and query processing functions and, optionally, the object storage (or distributed FS) for cheaper storage of cold data.

I suggest an open table transfer protocols functionally equivalent to BigQuery Storage APIs. The protocols should be implemented by open-source OLAP data warehouses. It would make the data just as accessible from different processing engines as object storage-based table formats do. But unlike adopting these table formats, this would come at no additional cost, system risk, and operational complexity for data teams who already use some OLAP data warehouse for its performance benefits and don’t plan to migrate off of it.

This post is the first part of the four-article series. In this article, I zoom into the architectural limitations of object storage-based table formats relevant to OLAP use cases.

In the second article, I introduce table transfer protocols and describe the design of Table Read protocol.

In the third article, I will describe Table Write protocol and table replication method.

In the fourth article, I summarise the preceding articles and discuss table transfer protocols’ trade-offs relative to object storage-based table formats from the data team’s perspective. Disclaimer: I think there are few, and if the table transfer protocols was supported by data warehouses and processing engines as widely as Iceberg is supported today, most teams would choose to “unlock” the data in their OLAP data warehouse via the table transfer protocols rather than Iceberg for purely technical reasons.

Data warehouses designed for OLAP manage their table storage themselves

All OLAP data warehouses with transparent data tiering between different types of disks and/or nodes in the cluster ClickHouse, Apache Doris, StarRocks, Apache Druid and Apache Pinot1 have to manage their table storage2 on top of the ordinary file system, memory management, and networking APIs in Linux. This includes taking care of data and metadata durability, read availability, and write availability at the level of the distributed system.

ClickHouse, Apache Doris, and StarRocks can also evict cold data from the primary (node-attached) disk storage to object storage or a distributed FS for storage efficiency3, but it doesn’t change the fact that they need to manage their tables on their own, coherently across both disk and object storage tiers.

Most other data warehouses that have a disaggregated architecture with object storage in the bottom, namely Databend, Firebolt, Amazon Redshift (with Redshift Managed Storage), and Snowflake implement read- and write-through SSD caching (e.g., distributed ephemeral storage in Snowflake) and de facto also have to maintain a table storage view on top of these SSD disks, even if they asynchronously store table metadata in Iceberg. The SSD cache management is pretty much equivalent to “full” table catalog/storage management, with only minor differences in replication and eviction strategies.

The performance and efficiency trade-offs between these architectural approaches in OLAP-ready data warehouses are very nuanced and application-specific:

Transparent storage tiering between node types, disk types, and object storage (or distributed FS), as in ClickHouse, Apache Doris, and StarRocks.
Lending all data and metadata in the object storage and doing SSD caching on top, as in Databend, Firebolt, Amazon Redshift (with RMS), and Snowflake (native storage).
A custom, node-disaggregated table storage, as in BigQuery (native storage).
Synchronisation of metadata between the data warehouse’s custom metadata management system and Iceberg, as in Snowflake (with Iceberg tables managed in Snowflake catalog) and BigQuery (with BigLake-managed Iceberg tables).
Other hybrid approaches that fit neither of the above descriptions, such as in Apache Druid and Apache Pinot.

Discussing these trade-offs is beyond the scope of this article. The point that I want to make here is that in all these approaches, data warehouses manage table storage themselves in one way or another rather than rely on austere object storage-based table formats and their implementations as separate services (such as Apache Hive’s Metastore, Tabular, Databricks Unity Catalog, AWS Glue Data Catalog, Dremio, etc.)4 Otherwise, their OLAP performance and efficiency would suffer. I explain why in the next section.

Why object storage-based table formats are inefficient for OLAP use cases

Network IO amplification: inability to filter or pre-aggregate rows on the storage side

When a query includes a simple filter on a table column that is not a sort column in the Parquet file nor the table partition column, the object storage has to send the entire column of the Parquet file over the network to be filtered on the side of the query processing node. This is wasteful if the filter is highly selective and is simple (e.g., an equality or a numeric comparison) so it doesn’t require much of the storage node’s CPU time to apply this filter.

Read amplification also occurs when the query computes total (i.e., non-grouped) aggregates (or aggregates grouped by a low-cardinality column) such as sum, min, max, sum + count (to compute the average in the end), or data sketches for approximate quantiles, distinct count, and more. The object storage nodes cannot compute the pre-aggregation results per data partition file and then send these results to query processing nodes for the final aggregation across partitions. The data sketch structure, let alone a single numerical aggregation result is order(s) of magnitude smaller than the original column to send over the network.

Request amplification when accessing many data partition files on the same storage nodes

Often, there are many more data partition files to be processed in a query than there are query processing nodes (and storage nodes, with self-managed object storage or distributed FS). To read each partition file, a separate series of network requests should be made to the object storage (one request to read the Parquet file footer, then a separate request to read every needed column within each row group) because the query processing nodes don’t know how file’s blocks are physically placed on the object storage nodes.

This request amplification costs money on the most popular S3, GCS, and Azure Blob object storage services: they all charge for every request made to their services.

However, even if the object storage service doesn’t charge extra for each request (e.g., Wasabi), query processing nodes can still endure the overhead of opening and managing 1-2 orders of magnitude more network connections, data buffers, and query execution sub-streams that would otherwise be needed in a setup with a data warehouse-managed table storage on disks or self-managed object storage if the data was transmitted using just one network connection per each pair of storage nodes and query processing nodes that had to hand over the data for a particular query and execution topology.

The tradeoff between ingestion data latency and “small file problem” with storage amplification

The copy-on-write approach in the object storage-based table formats to updating metadata and copy-on-write or merge-on-read approaches to updating data partitions during ingestion lead to a high write and storage amplification due to the object storage overhead of managing a lot of small files (the so-called “small file problem”) if the data engineer wants to commit data to the object storage with high frequency, e.g., once every few minutes, to make the ingested data available for querying with that latency.

Table catalogs mitigate this storage amplification with background compaction and “vacuum” procedures. These procedures themselves are relatively CPU-intensive and lead to even more write amplification. The latter is not a concern for almost all object storage services that don’t charge anything extra for written bytes and file deletes, but this could be somewhat of a concern in setups with self-managed object storage or distributed FS because the burden of the heightened disk write volume and file churn will fall on their shoulders.

However, even if the data team is ready to pay this write amplification price for the data ingestion latency of a few (tens of) minutes, instant insertion latency is completely unachievable unless stateful ingestion nodes with disks are considered by the table catalog. This cannot fit into object storage-only table format designs of Iceberg, Delta Lake, and Hudi.

This is why ClickHouse, Druid, Pinot, Doris, StarRocks, Firebolt, and other data warehouses ingest data on nodes with SSD disks (making these inserts queryable instantly) and manage batching, indexing, and compaction of the data partitions into columnar format (stored also on disks, or in object storage) asynchronously. This is also why Snowflake has ultimately added hybrid tables. Object storage-based table catalogs are blind to this freshly ingested data by design.

Object storage-based table formats miss the innovations in file formats for data partitions

There is a lot of innovation in the space of file formats for columnar data partitions: see Nimble, Lance, BtrBlocks, Vortex, not to mention internal file formats of ClickHouse, Doris, Druid, Pinot, StarRocks, and other data warehouses that also steadily improve.

Apart from “mundane” improvements in the file layout for columnar data and skip indexing that permits reading as few file blocks as possible, the new file formats innovate on new column data types and encodings, such as for vector embeddings. These file formats also have first-class support for secondary indexes, such as inverted/bitmap, full-text, vector, geo, or star-tree indexes. These indexes deliver the most benefit in use cases with a high volume of (OLAP) queries of a particular kind, such as those generated by interactive analytics, text search, geo search, and AI apps.

Secondary indexes in data partition files also synergise with storage-side filtering and pre-aggregation (as discussed in the section “Network IO amplification” above) because secondary indexes reduce the CPU and memory requirements of these pre-computations and thus permit storage nodes (which can be relatively CPU- and memory-poor) to do more queries with such pre-computations in parallel.

Note that the argument in this section is not purely technical, and is not about using object storage per se. Data warehouses can benefit from innovative data partition file formats even for the cold data that is stored in the object storage rather than on the disks attached to the data warehouse’s nodes.

Rather, this argument is about the fact that object storage-based table formats are forced to make the data partition file formats the parts of their specifications.

Although the “big three” table formats allow some flexibility in data partition file formats, that is, choosing between Apache Avro, ORC, and Parquet, they couldn’t easily add new formats to this list because the ability to read these new formats should be supported by all processing engines separately rather than just the table catalog implementations. Many processing engines would lag in adding support for these new file formats. This would undermine the original selling point of the table formats, namely, unlocking the table data for querying from arbitrary processing engines in parallel to the primary data warehouse.5

In practice, even if Iceberg, Hudi, or Delta Lake eventually add new data partition file formats (such as Nimble), it will take a lot of time in discussions among the numerous stakeholders of these table formats.

I don’t imply that new things should be added hastily to open-source table formats on which a lot of stakeholders depend. Unfortunately, on-disk format specifications necessarily move very slowly.

However, a table transfer network protocol would be a more compact interface that can nevertheless benefit from the innovation in the storage layer, while also enabling this innovation by abstracting from storage layout concerns, as I describe in the second part of this three-article series.

Apache Druid and Apache Pinot delegate to the object storage for the durability of “historical” data partitions but ensure read and write availability, as well as the durability of freshly ingested data on their own.

I use the term “table storage” here rather than “table catalog” because few of these data warehouses explicitly encapsulate the functions of table catalogues (such as Iceberg) in an internal component or abstraction. Rather, the realisation of these functions is speared across the data warehouse internals and is intertwined with the implementation of query execution, resource management, and other things that data warehouses do. This lack of encapsulation and, therefore, the lack of (process) isolation of the table storage abstraction is not an issue in itself, except in very exotic situations such as the data warehouse failing cluster-wide due to a “poison pill” query. If the data warehouse was to implement the open table storage protocol that I propose in the second part of this two-article series, it would naturally introduce this table storage abstraction at least on the level of their source code organisation.

Object storage and distributed file systems apply erasure coding to achieve <1.5x storage overhead for durability. In all open-source data warehouses with disk-based storage that I know of, erasure coding is not implemented (and is probably impractical because it also involves query performance trade-offs) and a simple replication scheme is used. This entails 2x or 3x storage overhead.

An example of the latter, “table catalog-first” architecture is Redshift Spectrum (or Redshift Serverless) + AWS Glue Data Catalog for managing Iceberg tables + AWS S3 for storing Iceberg data and metadata. If this approach was performance-optimal and delivered acceptable latency in OLAP use cases, there would be no need for the Redshift Manage Storage offering.

This concern would be somewhat alleviated if a lot of these query engines converged on using Apache DataFusion as an embedded library for storage access. Then, only this library would need to promptly add support for reading the new data partition file formats.

Engineering Ideas

Discussion about this post