Iceberg Nessie Integration¶
Iceberg provides integration with Nessie through the
This section describes how to use Iceberg with Nessie. Nessie provides several key features on top of iceberg:
- multi-table transactions
- git-like operations (eg branches, tags, commits)
- hive-like metastore capabilities
Enabling Nessie Catalog¶
iceberg-nessie module is bundled with Spark and Flink runtimes for all versions from
0.11.0. To get started
with nessie and iceberg simply add the iceberg runtime to your process. Eg:
One major feature introduced in release
0.11.0 is the ability to easily interact with a Custom
Catalog from Spark and Flink. See Spark Configuration
and Flink Configuration for instructions for adding a custom catalog to Iceberg.
To use the Nessie Catalog the following properties are required:
warehouse. Like most other catalogs the warehouse property is a file path to where this catalog should store tables.
uri. This is the Nessie server base uri. Eg
ref(optional). This is the Nessie branch or tag you want to work in.
To run directly in Java this looks like:
Map<String, String> options = new HashMap<>(); options.put("warehouse", "/path/to/warehouse"); options.put("ref", "main"); options.put("uri", "https://localhost:19120/api/v1"); Catalog nessieCatalog = CatalogUtil.loadCatalog("org.apache.iceberg.nessie.NessieCatalog", "nessie", hadoopConfig, options);
and in Spark:
conf.set("spark.sql.catalog.nessie.warehouse", "/path/to/warehouse"); conf.set("spark.sql.catalog.nessie.uri", "http://localhost:19120/api/v1") conf.set("spark.sql.catalog.nessie.ref", "main") conf.set("spark.sql.catalog.nessie.catalog-impl", "org.apache.iceberg.nessie.NessieCatalog") conf.set("spark.sql.catalog.nessie", "org.apache.iceberg.spark.SparkCatalog")
There is nothing special above about the
nessie name. A spark catalog can have any name, the important parts are the
settings for the
catalog-impl and the required config to start Nessie correctly.
Once you have a Nessie catalog you have access to your entire Nessie repo. You can then perform create/delete/merge
operations on branches and perform commits on branches. Each iceberg table in a Nessie Catalog is identified by an
arbitrary length namespace and table name (eg
data.base.name.table). These namespaces are implicit and don’t need to
be created separately. Any transaction on a Nessie enabled Iceberg table is a single commit in Nessie. Nessie commits
can encompass an arbitrary number of actions on an arbitrary number of tables, however in Iceberg this will be limited
to the set of single table transactions currently available.
Further operations such as merges, viewing the commit log or diffs are performed by direct interaction with the
NessieClient in java or by using the python client or cli. See Nessie CLI for
more details on the CLI and Spark Guide for a more complete description of
Nessie and Iceberg¶
For most cases Nessie acts just like any other Catalog for Iceberg: providing a logical organization of a set of tables and providing atomicity to transactions. However using Nessie opens up other interesting possibilities. When using Nessie with iceberg every iceberg transaction becomes a nessie commit. This history can be listed, merged or cherry-picked across branches.
Loosely coupled transactions¶
By creating a branch and performing a set of operations on that branch you can approximate a multi-table transaction. A sequence of commits can be performed on the newly created branch and then merged back into the main branch atomically. This gives the appearance of a series of connected changes being exposed to the main branch simultaneously. While downstream consumers will see multiple transactions appear at once this isn’t a true multi-table transaction on the database. It is effectively a fast-forward merge of multiple commits (in git language) and each operation from the branch is its own distinct transaction and commit. This is different from a real multi-table transaction where all changes would be in the same commit. This does allow multiple applications to take part in modifying a branch and for this distributed set of transactions to be exposed to the downstream users simultaneously.
Changes to a table can be tested in a branch before merging back into main. This is particularly useful when performing large changes like schema evolution or partition evolution. A partition evolution could be performed in a branch and you would be able to test out the change (eg performance benchmarks) before merging it. This provides great flexibility in performing on-line table modifications and testing without interrupting downstream use cases. If the changes are incorrect or not performant the branch can be dropped without being merged.
Further use cases¶
Please see the Nessie Documentation for further descriptions of Nessie features.
Please see Nessie Iceberg Demo for a complete example of Nessie and Iceberg in action together.
- Nessie SQL extensions to manage the Nessie repo from Spark SQL
- Iceberg multi-table transactions. Changes to multiple Iceberg tables in the same transaction, isolation levels etc