Today we’re excited to announce built-in maintenance for Iceberg in Crunchy Data Warehouse. This enhancement to Crunchy Data Warehouse brings PostgreSQL-style maintenance directly to Iceberg. The warehouse autovacuum workers continuously optimize Iceberg tables by compacting data and cleaning up expired files. In this post, we’ll explore how we handle cleanup, and in the follow-up posts, we’ll take a deeper dive into compaction.
If you use Postgres, you are probably familiar with tables and rows in a relational database. Instead of storing data in Postgres’ pages, Iceberg organizes the data into Parquet files and typically stores them in object storage like S3 with an organizational layer on top. Parquet is a compressed columnar file format that stores data efficiently. And Iceberg is designed to handle analytical queries across large datasets.
On Crunchy Data Warehouse, Postgres tables backed by Iceberg behave almost exactly like regular Postgres tables. You can run full SQL queries, perform ACID transactions, and use standard DDL commands like CREATE TABLE or ALTER TABLE. We’re excited to add vacuum processes to Iceberg to create an even better and hassle free user experience.
Orphan Files in Iceberg
In Postgres, when you update or delete rows, the changes happen inside the same table storage. The database keeps track of visibility using MVCC, and old versions of rows are eventually freed up by vacuum.
Iceberg works differently because its data files are immutable. When you update or delete data, Iceberg doesn’t modify existing files—it creates new ones with the updated data. The table’s metadata is then updated to point to the new files, while the old ones become unreferenced.
Over time, as more updates and deletes happen, these orphaned files—ones that are no longer referenced by any active table snapshot—start to accumulate.
Cleaning up orphan files
Just li