Iceberg Java API #
The main purpose of the Iceberg API is to manage table metadata, like schema, partition spec, metadata, and data files that store table data.
Table metadata and operations are accessed through the
Table interface. This interface will return table information.
Table metadata #
Table interface provides access to the table metadata:
schemareturns the current table schema
specreturns the current table partition spec
propertiesreturns a map of key-value properties
currentSnapshotreturns the current table snapshot
snapshotsreturns all valid snapshots for the table
snapshot(id)returns a specific snapshot by ID
locationreturns the table’s base location
Tables also provide
refresh to update the table to the latest version, and expose helpers:
FileIOused to read and write table files
LocationProviderused to create paths for data and metadata files
File level #
Iceberg table scans start by creating a
TableScan object with
TableScan scan = table.newScan();
To configure a scan, call
select on the
TableScan to get a new
TableScan with those changes.
TableScan filteredScan = scan.filter(Expressions.equal("id", 5))
Calls to configuration methods create a new
TableScan so that each
TableScan is immutable and won’t change unexpectedly if shared across threads.
When a scan is configured,
schema are used to return files, tasks, and the read projection.
TableScan scan = table.newScan() .filter(Expressions.equal("id", 5)) .select("id", "data"); Schema projection = scan.schema(); Iterable<CombinedScanTask> tasks = scan.planTasks();
useSnapshot to configure the table snapshot for time travel queries.
Row level #
Iceberg table scans start by creating a
ScanBuilder object with
ScanBuilder scanBuilder = IcebergGenerics.read(table)
To configure a scan, call
select on the
ScanBuilder to get a new
ScanBuilder with those changes.
When a scan is configured, call method
build to execute scan.
CloseableIterable<Record> result = IcebergGenerics.read(table) .where(Expressions.lessThan("id", 5)) .build();
Record is Iceberg record for iceberg-data module
Update operations #
Table also exposes operations that update the table. These operations use a builder pattern,
PendingUpdate, that commits when
PendingUpdate#commit is called.
For example, updating the table schema is done by calling
updateSchema, adding updates to the builder, and finally calling
commit to commit the pending changes to the table:
table.updateSchema() .addColumn("count", Types.LongType.get()) .commit();
Available operations to update a table are:
updateSchema– update the table schema
updateProperties– update table properties
updateLocation– update the table’s base location
newAppend– used to append data files
newFastAppend– used to append data files, will not compact metadata
newOverwrite– used to append data files and remove files that are overwritten
newDelete– used to delete data files
newRewrite– used to rewrite data files; will replace existing files with new versions
newTransaction– create a new table-level transaction
rewriteManifests– rewrite manifest data by clustering files, for faster scan planning
rollback– rollback the table state to a specific snapshot
Transactions are used to commit multiple table changes in a single atomic operation. A transaction is used to create individual operations using factory methods, like
newAppend, just like working with a
Table. Operations created by a transaction are committed as a group when
commitTransaction is called.
For example, deleting and appending a file in the same transaction:
Transaction t = table.newTransaction(); // commit operations to the transaction t.newDelete().deleteFromRowFilter(filter).commit(); t.newAppend().appendFile(data).commit(); // commit all the changes to the table t.commitTransaction();
Iceberg data types are located in the
Primitive type instances are available from static methods in each type class. Types without parameters use
get, and types like
decimal use factory methods:
Types.IntegerType.get() // int Types.DoubleType.get() // double Types.DecimalType.of(9, 2) // decimal(9, 2)
Nested types #
Structs, maps, and lists are created using factory methods in type classes.
Like struct fields, map keys or values and list elements are tracked as nested fields. Nested fields track field IDs and nullability.
Struct fields are created using
NestedField.required. Map value and list element nullability is set in the map and list factory methods.
// struct<1 id: int, 2 data: optional string> StructType struct = Struct.of( Types.NestedField.required(1, "id", Types.IntegerType.get()), Types.NestedField.optional(2, "data", Types.StringType.get()) )
// map<1 key: int, 2 value: optional string> MapType map = MapType.ofOptional( 1, 2, Types.IntegerType.get(), Types.StringType.get() )
// array<1 element: int> ListType list = ListType.ofRequired(1, IntegerType.get());
Iceberg’s expressions are used to configure table scans. To create expressions, use the factory methods in
Supported predicate expressions are:
Supported expression operations are:
Constant expressions are:
Expression binding #
When created, expressions are unbound. Before an expression is used, it will be bound to a data type to find the field ID the expression name represents, and to convert predicate literals.
For example, before using the expression
lessThan("x", 10), Iceberg needs to determine which column
"x" refers to and convert
10 to that column’s data type.
If the expression could be bound to the type
struct<1 x: long, 2 y: long> or to
struct<11 x: int, 12 y: int>.
Expression example #
table.newScan() .filter(Expressions.greaterThanOrEqual("x", 5)) .filter(Expressions.lessThan("x", 10))
Iceberg table support is organized in library modules:
iceberg-commoncontains utility classes used in other modules
iceberg-apicontains the public Iceberg API, including expressions, types, tables, and operations
iceberg-arrowis an implementation of the Iceberg type system for reading and writing data stored in Iceberg tables using Apache Arrow as the in-memory data format
iceberg-awscontains implementations of the Iceberg API to be used with tables stored on AWS S3 and/or for tables defined using the AWS Glue data catalog
iceberg-corecontains implementations of the Iceberg API and support for Avro data files, this is what processing engines should depend on
iceberg-parquetis an optional module for working with tables backed by Parquet files
iceberg-orcis an optional module for working with tables backed by ORC files (experimental)
iceberg-hive-metastoreis an implementation of Iceberg tables backed by the Hive metastore Thrift client
This project Iceberg also has modules for adding Iceberg support to processing engines and associated tooling:
iceberg-spark2is an implementation of Spark’s Datasource V2 API in 2.4 for Iceberg (use iceberg-spark-runtime for a shaded version)
iceberg-spark3is an implementation of Spark’s Datasource V2 API in 3.0 for Iceberg (use iceberg-spark3-runtime for a shaded version)
iceberg-spark-3.1is an implementation of Spark’s Datasource V2 API in 3.1 for Iceberg (use iceberg-spark-runtime-3.1 for a shaded version)
iceberg-spark-3.2is an implementation of Spark’s Datasource V2 API in 3.2 for Iceberg (use iceberg-spark-runtime-3.2 for a shaded version)
iceberg-flinkis an implementation of Flink’s Table and DataStream API for Iceberg (use iceberg-flink-runtime for a shaded version)
iceberg-hive3is an implementation of Hive 3 specific SerDe’s for Timestamp, TimestampWithZone, and Date object inspectors (use iceberg-hive-runtime for a shaded version).
iceberg-mris an implementation of MapReduce and Hive InputFormats and SerDes for Iceberg (use iceberg-hive-runtime for a shaded version for use with Hive)
iceberg-nessieis a module used to integrate Iceberg table metadata history and operations with Project Nessie
iceberg-datais a client library used to read Iceberg tables from JVM applications
iceberg-pigis an implementation of Pig’s LoadFunc API for Iceberg
iceberg-runtimegenerates a shaded runtime jar for Spark to integrate with iceberg tables