Hive

Hive read support

Iceberg supports the reading of Iceberg tables from Hive by using a StorageHandler. Please note that only Hive 2.x versions are currently supported.

Table creation

This section explains the various steps needed in order to overlay a Hive table “on top of” an existing Iceberg table. Iceberg tables are created using either a Catalog or an implementation of the Tables interface and Hive needs to be configured accordingly to read data from these different types of table.

Add the Iceberg Hive Runtime jar file to the Hive classpath

Regardless of the table type, the HiveIcebergStorageHandler and supporting classes need to be made available on Hive’s classpath. These are provided by the iceberg-hive-runtime jar file. For example, if using the Hive shell, this can be achieved by issuing a statement like so:

add jar /path/to/iceberg-hive-runtime.jar;

There are many others ways to achieve this including adding the jar file to Hive’s auxillary classpath (so it is available by default) - please refer to Hive’s documentation for more information.

Using Hadoop Tables

Iceberg tables created using HadoopTables are stored entirely in a directory in a filesytem like HDFS.

Create an Iceberg table

The first step is to create an Iceberg table using the Spark/Java/Python API and HadoopTables. For the purposes of this documentation we will assume that the table is called table_a and that the table location is hdfs://some_path/table_a.

Create a Hive table

Now overlay a Hive table on top of this Iceberg table by issuing Hive DDL like so:

CREATE EXTERNAL TABLE table_a 
STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler' 
LOCATION 'hdfs://some_bucket/some_path/table_a';

Query the Iceberg table via Hive

You should now be able to issue Hive SQL SELECT queries using the above table and see the results returned from the underlying Iceberg table. Both the Map Reduce and Tez query execution engines are supported.

SELECT * from table_a;

Using Hive Catalog

Iceberg tables created using HiveCatalog are automatically registered with Hive.

Create an Iceberg table

The first step is to create an Iceberg table using the Spark/Java/Python API and HiveCatalog. For the purposes of this documentation we will assume that the table is called table_b and that the table location is s3://some_path/table_b. In order for Iceberg to correctly set up the Hive table for querying some configuration values need to be set, the two options for this are described below - you can use either or the other depending on your use case.

Hive Configuration

The value iceberg.engine.hive.enabled needs to be set to true and added to the Hive configuration file on the classpath of the application creating the table. This can be done by modifying the relevant hive-site.xml. Alternatively this can done programatically like so:

Configuration hadoopConfiguration = spark.sparkContext().hadoopConfiguration();
hadoopConfiguration.set(ConfigProperties.ENGINE_HIVE_ENABLED, "true"); //iceberg.engine.hive.enabled=true
HiveCatalog catalog = new HiveCatalog(hadoopConfiguration);
...
catalog.createTable(tableId, schema, spec);
Table Property Configuration

The property engine.hive.enabled needs to be set to true and added to the table properties when creating the Iceberg table. This can be done like so:

    Map<String, String> tableProperties = new HashMap<String, String>();
    tableProperties.put(TableProperties.ENGINE_HIVE_ENABLED, "true"); //engine.hive.enabled=true
    catalog.createTable(tableId, schema, spec, tableProperties);

Query the Iceberg table via Hive

In order to query a Hive table created by either of the HiveCatalog methods described above you need to first set a Hive configuration value like so:

SET iceberg.mr.catalog=hive;

You should now be able to issue Hive SQL SELECT queries using the above table and see the results returned from the underlying Iceberg table. Both the Map Reduce and Tez query execution engines are supported.

SELECT * from table_b;

Features

Predicate pushdown

Pushdown of the Hive SQL WHERE clause has been implemented so that these filters are used at the Iceberg TableScan level as well as by the Parquet and ORC Readers.