SparkTableUtil

java.lang.Object
- org.apache.iceberg.spark.SparkTableUtil

```
public class SparkTableUtil
extends java.lang.Object
```
Java version of the original SparkTableUtil.scala https://github.com/apache/iceberg/blob/apache-iceberg-0.8.0-incubating/spark/src/main/scala/org/apache/iceberg/spark/SparkTableUtil.scala

Nested Class Summary

Nested Classes
Modifier and Type Class and Description

static class SparkTableUtil.SparkPartition
Class representing a table partition.

Nested Classes
Modifier and Type	Class and Description
`static class`	`SparkTableUtil.SparkPartition` Class representing a table partition.

Method Summary

All Methods Static Methods Concrete Methods
Modifier and Type	Method and Description
`static java.util.List<SparkTableUtil.SparkPartition>`	`getPartitions(org.apache.spark.sql.SparkSession spark, java.lang.String table)` Returns all partitions in the table.
`static java.util.List<SparkTableUtil.SparkPartition>`	`getPartitions(org.apache.spark.sql.SparkSession spark, org.apache.spark.sql.catalyst.TableIdentifier tableIdent)` Returns all partitions in the table.
`static java.util.List<SparkTableUtil.SparkPartition>`	`getPartitionsByFilter(org.apache.spark.sql.SparkSession spark, java.lang.String table, java.lang.String predicate)` Returns partitions that match the specified 'predicate'.
`static java.util.List<SparkTableUtil.SparkPartition>`	`getPartitionsByFilter(org.apache.spark.sql.SparkSession spark, org.apache.spark.sql.catalyst.TableIdentifier tableIdent, org.apache.spark.sql.catalyst.expressions.Expression predicateExpr)` Returns partitions that match the specified 'predicate'.
`static void`	`importSparkPartitions(org.apache.spark.sql.SparkSession spark, java.util.List<SparkTableUtil.SparkPartition> partitions, Table targetTable, PartitionSpec spec, java.lang.String stagingDir)` Import files from given partitions to an Iceberg table.
`static void`	`importSparkTable(org.apache.spark.sql.SparkSession spark, org.apache.spark.sql.catalyst.TableIdentifier sourceTableIdent, Table targetTable, java.lang.String stagingDir)` Import files from an existing Spark table to an Iceberg table.
`static java.util.List<DataFile>`	`listPartition(java.util.Map<java.lang.String,java.lang.String> partition, java.lang.String uri, java.lang.String format, PartitionSpec spec, org.apache.hadoop.conf.Configuration conf, MetricsConfig metricsConfig)` Returns the data files in a partition by listing the partition location.
`static java.util.List<DataFile>`	`listPartition(java.util.Map<java.lang.String,java.lang.String> partition, java.lang.String uri, java.lang.String format, PartitionSpec spec, org.apache.hadoop.conf.Configuration conf, MetricsConfig metricsConfig, NameMapping mapping)` Returns the data files in a partition by listing the partition location.
`static java.util.List<DataFile>`	`listPartition(SparkTableUtil.SparkPartition partition, PartitionSpec spec, SerializableConfiguration conf, MetricsConfig metricsConfig)` Returns the data files in a partition by listing the partition location.
`static java.util.List<DataFile>`	`listPartition(SparkTableUtil.SparkPartition partition, PartitionSpec spec, SerializableConfiguration conf, MetricsConfig metricsConfig, NameMapping mapping)` Returns the data files in a partition by listing the partition location.
`static org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>`	`partitionDF(org.apache.spark.sql.SparkSession spark, java.lang.String table)` Returns a DataFrame with a row for each partition in the table.
`static org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>`	`partitionDFByFilter(org.apache.spark.sql.SparkSession spark, java.lang.String table, java.lang.String expression)` Returns a DataFrame with a row for each partition that matches the specified 'expression'.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Method Detail

partitionDF
```
public static org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> partitionDF(org.apache.spark.sql.SparkSession spark,
                                                                                 java.lang.String table)
```
Returns a DataFrame with a row for each partition in the table. The DataFrame has 3 columns, partition key (a=1/b=2), partition location, and format (avro or parquet).

Parameters:

spark - a Spark session

table - a table name and (optional) database

Returns:

a DataFrame of the table's partitions

partitionDFByFilter

public static org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> partitionDFByFilter(org.apache.spark.sql.SparkSession spark,
                                                                                         java.lang.String table,
                                                                                         java.lang.String expression)

Returns a DataFrame with a row for each partition that matches the specified 'expression'.

Parameters:: spark - a Spark session.; table - name of the table.; expression - The expression whose matching partitions are returned.
Returns:: a DataFrame of the table partitions.

getPartitions

public static java.util.List<SparkTableUtil.SparkPartition> getPartitions(org.apache.spark.sql.SparkSession spark,
                                                                          java.lang.String table)

Returns all partitions in the table.

Parameters:: spark - a Spark session; table - a table name and (optional) database
Returns:: all table's partitions

getPartitions

public static java.util.List<SparkTableUtil.SparkPartition> getPartitions(org.apache.spark.sql.SparkSession spark,
                                                                          org.apache.spark.sql.catalyst.TableIdentifier tableIdent)

Returns all partitions in the table.

Parameters:: spark - a Spark session; tableIdent - a table identifier
Returns:: all table's partitions

getPartitionsByFilter

public static java.util.List<SparkTableUtil.SparkPartition> getPartitionsByFilter(org.apache.spark.sql.SparkSession spark,
                                                                                  java.lang.String table,
                                                                                  java.lang.String predicate)

Returns partitions that match the specified 'predicate'.

Parameters:: spark - a Spark session; table - a table name and (optional) database; predicate - a predicate on partition columns
Returns:: matching table's partitions

getPartitionsByFilter

public static java.util.List<SparkTableUtil.SparkPartition> getPartitionsByFilter(org.apache.spark.sql.SparkSession spark,
                                                                                  org.apache.spark.sql.catalyst.TableIdentifier tableIdent,
                                                                                  org.apache.spark.sql.catalyst.expressions.Expression predicateExpr)

Returns partitions that match the specified 'predicate'.

Parameters:: spark - a Spark session; tableIdent - a table identifier; predicateExpr - a predicate expression on partition columns
Returns:: matching table's partitions

listPartition

public static java.util.List<DataFile> listPartition(SparkTableUtil.SparkPartition partition,
                                                     PartitionSpec spec,
                                                     SerializableConfiguration conf,
                                                     MetricsConfig metricsConfig)

Returns the data files in a partition by listing the partition location. For Parquet and ORC partitions, this will read metrics from the file footer. For Avro partitions, metrics are set to null.

Parameters:: partition - a partition; conf - a serializable Hadoop conf; metricsConfig - a metrics conf
Returns:: a List of DataFile

listPartition

public static java.util.List<DataFile> listPartition(SparkTableUtil.SparkPartition partition,
                                                     PartitionSpec spec,
                                                     SerializableConfiguration conf,
                                                     MetricsConfig metricsConfig,
                                                     NameMapping mapping)

Returns the data files in a partition by listing the partition location. For Parquet and ORC partitions, this will read metrics from the file footer. For Avro partitions, metrics are set to null.

Parameters:: partition - a partition; conf - a serializable Hadoop conf; metricsConfig - a metrics conf; mapping - a name mapping
Returns:: a List of DataFile

listPartition

public static java.util.List<DataFile> listPartition(java.util.Map<java.lang.String,java.lang.String> partition,
                                                     java.lang.String uri,
                                                     java.lang.String format,
                                                     PartitionSpec spec,
                                                     org.apache.hadoop.conf.Configuration conf,
                                                     MetricsConfig metricsConfig)

Returns the data files in a partition by listing the partition location. For Parquet and ORC partitions, this will read metrics from the file footer. For Avro partitions, metrics are set to null.

Parameters:: partition - partition key, e.g., "a=1/b=2"; uri - partition location URI; format - partition format, avro or parquet; spec - a partition spec; conf - a Hadoop conf; metricsConfig - a metrics conf
Returns:: a List of DataFile

listPartition

public static java.util.List<DataFile> listPartition(java.util.Map<java.lang.String,java.lang.String> partition,
                                                     java.lang.String uri,
                                                     java.lang.String format,
                                                     PartitionSpec spec,
                                                     org.apache.hadoop.conf.Configuration conf,
                                                     MetricsConfig metricsConfig,
                                                     NameMapping mapping)

Returns the data files in a partition by listing the partition location.

For Parquet and ORC partitions, this will read metrics from the file footer. For Avro partitions, metrics are set to null.

Note: certain metrics, like NaN counts, that are only supported by iceberg file writers but not file footers, will not be populated.

Parameters:: partition - partition key, e.g., "a=1/b=2"; uri - partition location URI; format - partition format, avro or parquet; spec - a partition spec; conf - a Hadoop conf; metricsConfig - a metrics conf; mapping - a name mapping
Returns:: a List of DataFile

importSparkTable
```
public static void importSparkTable(org.apache.spark.sql.SparkSession spark,
                                    org.apache.spark.sql.catalyst.TableIdentifier sourceTableIdent,
                                    Table targetTable,
                                    java.lang.String stagingDir)
```
Import files from an existing Spark table to an Iceberg table. The import uses the Spark session to get table metadata. It assumes no operation is going on the original and target table and thus is not thread-safe.

Parameters:

spark - a Spark session

sourceTableIdent - an identifier of the source Spark table

targetTable - an Iceberg table where to import the data

stagingDir - a staging directory to store temporary manifest files

importSparkPartitions

public static void importSparkPartitions(org.apache.spark.sql.SparkSession spark,
                                         java.util.List<SparkTableUtil.SparkPartition> partitions,
                                         Table targetTable,
                                         PartitionSpec spec,
                                         java.lang.String stagingDir)

Import files from given partitions to an Iceberg table.

Parameters:: spark - a Spark session; partitions - partitions to import; targetTable - an Iceberg table where to import the data; spec - a partition spec; stagingDir - a staging directory to store temporary manifest files

Class SparkTableUtil

Nested Class Summary

Method Summary

Methods inherited from class java.lang.Object

Method Detail

partitionDF

partitionDFByFilter

getPartitions

getPartitions

getPartitionsByFilter

getPartitionsByFilter

listPartition

listPartition

listPartition

listPartition

importSparkTable

importSparkPartitions