Getting Started🔗

The latest version of Iceberg is 1.9.1.

Spark is currently the most feature-rich compute engine for Iceberg operations. We recommend you to get started with Spark to understand Iceberg concepts and features with examples. You can also view documentations of using Iceberg with other compute engine under the Multi-Engine Support page.

Using Iceberg in Spark 3🔗

To use Iceberg in a Spark shell, use the --packages option:

spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.9.1

Info

If you want to include Iceberg in your Spark installation, add the iceberg-spark-runtime-3.5_2.12 Jar to Spark's jars folder.

Adding catalogs🔗

Iceberg comes with catalogs that enable SQL commands to manage tables and load them by name. Catalogs are configured using properties under spark.sql.catalog.(catalog_name).

This command creates a path-based catalog named local for tables under $PWD/warehouse and adds support for Iceberg tables to Spark's built-in catalog:

spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.9.1\
    --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
    --conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \
    --conf spark.sql.catalog.spark_catalog.type=hive \
    --conf spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog \
    --conf spark.sql.catalog.local.type=hadoop \
    --conf spark.sql.catalog.local.warehouse=$PWD/warehouse

Creating a table🔗

To create your first Iceberg table in Spark, use the spark-sql shell or spark.sql(...) to run a CREATE TABLE command:

-- local is the path-based catalog defined above
CREATE TABLE local.db.table (id bigint, data string) USING iceberg;

Iceberg catalogs support the full range of SQL DDL commands, including:

Writing🔗

Once your table is created, insert data using INSERT INTO:

INSERT INTO local.db.table VALUES (1, 'a'), (2, 'b'), (3, 'c');
INSERT INTO local.db.table SELECT id, data FROM source WHERE length(data) = 1;

Iceberg also adds row-level SQL updates to Spark, MERGE INTO and DELETE FROM:

MERGE INTO local.db.target t USING (SELECT * FROM updates) u ON t.id = u.id
WHEN MATCHED THEN UPDATE SET t.count = t.count + u.count
WHEN NOT MATCHED THEN INSERT *;

Iceberg supports writing DataFrames using the new v2 DataFrame write API:

spark.table("source").select("id", "data")
     .writeTo("local.db.table").append()

The old write API is supported, but not recommended.

Reading🔗

To read with SQL, use the Iceberg table's name in a SELECT query:

SELECT count(1) as count, data
FROM local.db.table
GROUP BY data;

SQL is also the recommended way to inspect tables. To view all snapshots in a table, use the snapshots metadata table:

SELECT * FROM local.db.table.snapshots;

+-------------------------+----------------+-----------+-----------+----------------------------------------------------+-----+
| committed_at            | snapshot_id    | parent_id | operation | manifest_list                                      | ... |
+-------------------------+----------------+-----------+-----------+----------------------------------------------------+-----+
| 2019-02-08 03:29:51.215 | 57897183625154 | null      | append    | s3://.../table/metadata/snap-57897183625154-1.avro | ... |
|                         |                |           |           |                                                    | ... |
|                         |                |           |           |                                                    | ... |
| ...                     | ...            | ...       | ...       | ...                                                | ... |
+-------------------------+----------------+-----------+-----------+----------------------------------------------------+-----+

DataFrame reads are supported and can now reference tables by name using spark.table:

val df = spark.table("local.db.table")
df.count()

Type compatibility🔗

Spark and Iceberg support different set of types. Iceberg does the type conversion automatically, but not for all combinations, so you may want to understand the type conversion in Iceberg in prior to design the types of columns in your tables.

Spark type to Iceberg type🔗

This type conversion table describes how Spark types are converted to the Iceberg types. The conversion applies on both creating Iceberg table and writing to Iceberg table via Spark.

Spark	Iceberg	Notes
boolean	boolean
short	integer
byte	integer
integer	integer
long	long
float	float
double	double
date	date
timestamp	timestamp with timezone
timestamp_ntz	timestamp without timezone
char	string
varchar	string
string	string
binary	binary
decimal	decimal
struct	struct
array	list
map	map

Info

The table is based on representing conversion during creating table. In fact, broader supports are applied on write. Here're some points on write:

Iceberg numeric types (integer, long, float, double, decimal) support promotion during writes. e.g. You can write Spark types short, byte, integer, long to Iceberg type long.
You can write to Iceberg fixed type using Spark binary type. Note that assertion on the length will be performed.

Iceberg type to Spark type🔗

This type conversion table describes how Iceberg types are converted to the Spark types. The conversion applies on reading from Iceberg table via Spark.

Iceberg	Spark	Note
boolean	boolean
integer	integer
long	long
float	float
double	double
date	date
time		Not supported
timestamp with timezone	timestamp
timestamp without timezone	timestamp_ntz
string	string
uuid	string
fixed	binary
binary	binary
decimal	decimal
struct	struct
list	array
map	map

Next steps🔗

Next, you can learn more about Iceberg tables in Spark:

DDL commands: CREATE, ALTER, and DROP
Querying data: SELECT queries and metadata tables
Writing data: INSERT INTO and MERGE INTO
Maintaining tables with stored procedures