Package org.apache.iceberg.spark
Class SparkTableUtil
- java.lang.Object
-
- org.apache.iceberg.spark.SparkTableUtil
-
public class SparkTableUtil extends java.lang.Object
Java version of the original SparkTableUtil.scala https://github.com/apache/iceberg/blob/apache-iceberg-0.8.0-incubating/spark/src/main/scala/org/apache/iceberg/spark/SparkTableUtil.scala
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
SparkTableUtil.SparkPartition
Class representing a table partition.
-
Method Summary
All Methods Static Methods Concrete Methods Modifier and Type Method Description static java.util.List<SparkTableUtil.SparkPartition>
getPartitions(org.apache.spark.sql.SparkSession spark, java.lang.String table)
Returns all partitions in the table.static java.util.List<SparkTableUtil.SparkPartition>
getPartitions(org.apache.spark.sql.SparkSession spark, org.apache.spark.sql.catalyst.TableIdentifier tableIdent)
Returns all partitions in the table.static java.util.List<SparkTableUtil.SparkPartition>
getPartitionsByFilter(org.apache.spark.sql.SparkSession spark, java.lang.String table, java.lang.String predicate)
Returns partitions that match the specified 'predicate'.static java.util.List<SparkTableUtil.SparkPartition>
getPartitionsByFilter(org.apache.spark.sql.SparkSession spark, org.apache.spark.sql.catalyst.TableIdentifier tableIdent, org.apache.spark.sql.catalyst.expressions.Expression predicateExpr)
Returns partitions that match the specified 'predicate'.static void
importSparkPartitions(org.apache.spark.sql.SparkSession spark, java.util.List<SparkTableUtil.SparkPartition> partitions, Table targetTable, PartitionSpec spec, java.lang.String stagingDir)
Import files from given partitions to an Iceberg table.static void
importSparkTable(org.apache.spark.sql.SparkSession spark, org.apache.spark.sql.catalyst.TableIdentifier sourceTableIdent, Table targetTable, java.lang.String stagingDir)
Import files from an existing Spark table to an Iceberg table.static java.util.List<DataFile>
listPartition(java.util.Map<java.lang.String,java.lang.String> partition, java.lang.String uri, java.lang.String format, PartitionSpec spec, org.apache.hadoop.conf.Configuration conf, MetricsConfig metricsConfig)
Returns the data files in a partition by listing the partition location.static java.util.List<DataFile>
listPartition(SparkTableUtil.SparkPartition partition, PartitionSpec spec, SerializableConfiguration conf, MetricsConfig metricsConfig)
Returns the data files in a partition by listing the partition location.static org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>
partitionDF(org.apache.spark.sql.SparkSession spark, java.lang.String table)
Returns a DataFrame with a row for each partition in the table.static org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>
partitionDFByFilter(org.apache.spark.sql.SparkSession spark, java.lang.String table, java.lang.String expression)
Returns a DataFrame with a row for each partition that matches the specified 'expression'.
-
-
-
Method Detail
-
partitionDF
public static org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> partitionDF(org.apache.spark.sql.SparkSession spark, java.lang.String table)
Returns a DataFrame with a row for each partition in the table. The DataFrame has 3 columns, partition key (a=1/b=2), partition location, and format (avro or parquet).- Parameters:
spark
- a Spark sessiontable
- a table name and (optional) database- Returns:
- a DataFrame of the table's partitions
-
partitionDFByFilter
public static org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> partitionDFByFilter(org.apache.spark.sql.SparkSession spark, java.lang.String table, java.lang.String expression)
Returns a DataFrame with a row for each partition that matches the specified 'expression'.- Parameters:
spark
- a Spark session.table
- name of the table.expression
- The expression whose matching partitions are returned.- Returns:
- a DataFrame of the table partitions.
-
getPartitions
public static java.util.List<SparkTableUtil.SparkPartition> getPartitions(org.apache.spark.sql.SparkSession spark, java.lang.String table)
Returns all partitions in the table.- Parameters:
spark
- a Spark sessiontable
- a table name and (optional) database- Returns:
- all table's partitions
-
getPartitions
public static java.util.List<SparkTableUtil.SparkPartition> getPartitions(org.apache.spark.sql.SparkSession spark, org.apache.spark.sql.catalyst.TableIdentifier tableIdent)
Returns all partitions in the table.- Parameters:
spark
- a Spark sessiontableIdent
- a table identifier- Returns:
- all table's partitions
-
getPartitionsByFilter
public static java.util.List<SparkTableUtil.SparkPartition> getPartitionsByFilter(org.apache.spark.sql.SparkSession spark, java.lang.String table, java.lang.String predicate)
Returns partitions that match the specified 'predicate'.- Parameters:
spark
- a Spark sessiontable
- a table name and (optional) databasepredicate
- a predicate on partition columns- Returns:
- matching table's partitions
-
getPartitionsByFilter
public static java.util.List<SparkTableUtil.SparkPartition> getPartitionsByFilter(org.apache.spark.sql.SparkSession spark, org.apache.spark.sql.catalyst.TableIdentifier tableIdent, org.apache.spark.sql.catalyst.expressions.Expression predicateExpr)
Returns partitions that match the specified 'predicate'.- Parameters:
spark
- a Spark sessiontableIdent
- a table identifierpredicateExpr
- a predicate expression on partition columns- Returns:
- matching table's partitions
-
listPartition
public static java.util.List<DataFile> listPartition(SparkTableUtil.SparkPartition partition, PartitionSpec spec, SerializableConfiguration conf, MetricsConfig metricsConfig)
Returns the data files in a partition by listing the partition location. For Parquet and ORC partitions, this will read metrics from the file footer. For Avro partitions, metrics are set to null.- Parameters:
partition
- a partitionconf
- a serializable Hadoop confmetricsConfig
- a metrics conf- Returns:
- a List of DataFile
-
listPartition
public static java.util.List<DataFile> listPartition(java.util.Map<java.lang.String,java.lang.String> partition, java.lang.String uri, java.lang.String format, PartitionSpec spec, org.apache.hadoop.conf.Configuration conf, MetricsConfig metricsConfig)
Returns the data files in a partition by listing the partition location. For Parquet and ORC partitions, this will read metrics from the file footer. For Avro partitions, metrics are set to null.- Parameters:
partition
- partition key, e.g., "a=1/b=2"uri
- partition location URIformat
- partition format, avro or parquetconf
- a Hadoop confmetricsConfig
- a metrics conf- Returns:
- a List of DataFile
-
importSparkTable
public static void importSparkTable(org.apache.spark.sql.SparkSession spark, org.apache.spark.sql.catalyst.TableIdentifier sourceTableIdent, Table targetTable, java.lang.String stagingDir)
Import files from an existing Spark table to an Iceberg table. The import uses the Spark session to get table metadata. It assumes no operation is going on the original and target table and thus is not thread-safe.- Parameters:
spark
- a Spark sessionsourceTableIdent
- an identifier of the source Spark tabletargetTable
- an Iceberg table where to import the datastagingDir
- a staging directory to store temporary manifest files
-
importSparkPartitions
public static void importSparkPartitions(org.apache.spark.sql.SparkSession spark, java.util.List<SparkTableUtil.SparkPartition> partitions, Table targetTable, PartitionSpec spec, java.lang.String stagingDir)
Import files from given partitions to an Iceberg table.- Parameters:
spark
- a Spark sessionpartitions
- partitions to importtargetTable
- an Iceberg table where to import the dataspec
- a partition specstagingDir
- a staging directory to store temporary manifest files
-
-