Table Evolution

Iceberg supports in-place table evolution. You can evolve a table schema just like SQL – even in nested structures – or change partition layout when data volume changes. Iceberg does not require costly distractions, like rewriting table data or migrating to a new table.

For example, Hive table partitioning cannot change so moving from a daily partition layout to an hourly partition layout requires a new table. And because queries are dependent on partitions, queries must be rewritten for the new table. In some cases, even changes as simple as renaming a column are either not supported, or can cause data correctness problems.

Schema evolution

Iceberg supports the following schema evolution changes:

Iceberg schema updates are metadata changes. Data files are not eagerly rewritten.

Note that map keys do not support adding or dropping struct fields that would change equality.

Correctness

Iceberg guarantees that schema evolution changes are independent and free of side-effects:

  1. Added columns never read existing values from another column.
  2. Dropping a column or field does not change the values in any other column.
  3. Updating a column or field does not change values in any other column.
  4. Changing the order of columns or fields in a struct does not change the values associated with a column or field name.

Iceberg uses unique IDs to track each column in a table. When you add a column, it is assigned a new ID so existing data is never used by mistake.

Partition evolution

Iceberg table partitioning can be updated in an existing table because queries do not reference partition values directly.

When you evolve a partition spec, the old data written with an earlier spec remains unchanged. New data is written using the new spec in a new layout. Metadata for each of the partition versions is kept separately. Because of this, when you start writing queries, you get split planning. This is where each partition layout plans files separately using the filter it derives for that specific partition layout. Here’s a visual representation of a contrived example:

Partition evolution diagram The data for 2008 is partitioned by month. Starting from 2009 the table is updated so that the data is instead partitioned by day. Both partitioning layouts are able to coexist in the same table.

Iceberg uses hidden partitioning, so you don’t need to write queries for a specific partition layout to be fast. Instead, you can write queries that select the data you need, and Iceberg automatically prunes out files that don’t contain matching data.

Partition evolution is a metadata operation and does not eagerly rewrite files.