Class SparkTableUtil

java.lang.Object
org.apache.iceberg.spark.SparkTableUtil

public class SparkTableUtil extends Object
Java version of the original SparkTableUtil.scala https://github.com/apache/iceberg/blob/apache-iceberg-0.8.0-incubating/spark/src/main/scala/org/apache/iceberg/spark/SparkTableUtil.scala
  • Method Details

    • partitionDF

      public static org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> partitionDF(org.apache.spark.sql.SparkSession spark, String table)
      Returns a DataFrame with a row for each partition in the table.

      The DataFrame has 3 columns, partition key (a=1/b=2), partition location, and format (avro or parquet).

      Parameters:
      spark - a Spark session
      table - a table name and (optional) database
      Returns:
      a DataFrame of the table's partitions
    • partitionDFByFilter

      public static org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> partitionDFByFilter(org.apache.spark.sql.SparkSession spark, String table, String expression)
      Returns a DataFrame with a row for each partition that matches the specified 'expression'.
      Parameters:
      spark - a Spark session.
      table - name of the table.
      expression - The expression whose matching partitions are returned.
      Returns:
      a DataFrame of the table partitions.
    • getPartitions

      public static List<SparkTableUtil.SparkPartition> getPartitions(org.apache.spark.sql.SparkSession spark, String table)
      Returns all partitions in the table.
      Parameters:
      spark - a Spark session
      table - a table name and (optional) database
      Returns:
      all table's partitions
    • getPartitions

      public static List<SparkTableUtil.SparkPartition> getPartitions(org.apache.spark.sql.SparkSession spark, org.apache.spark.sql.catalyst.TableIdentifier tableIdent, Map<String,String> partitionFilter)
      Returns all partitions in the table.
      Parameters:
      spark - a Spark session
      tableIdent - a table identifier
      partitionFilter - partition filter, or null if no filter
      Returns:
      all table's partitions
    • getPartitionsByFilter

      public static List<SparkTableUtil.SparkPartition> getPartitionsByFilter(org.apache.spark.sql.SparkSession spark, String table, String predicate)
      Returns partitions that match the specified 'predicate'.
      Parameters:
      spark - a Spark session
      table - a table name and (optional) database
      predicate - a predicate on partition columns
      Returns:
      matching table's partitions
    • getPartitionsByFilter

      public static List<SparkTableUtil.SparkPartition> getPartitionsByFilter(org.apache.spark.sql.SparkSession spark, org.apache.spark.sql.catalyst.TableIdentifier tableIdent, org.apache.spark.sql.catalyst.expressions.Expression predicateExpr)
      Returns partitions that match the specified 'predicate'.
      Parameters:
      spark - a Spark session
      tableIdent - a table identifier
      predicateExpr - a predicate expression on partition columns
      Returns:
      matching table's partitions
    • importSparkTable

      public static void importSparkTable(org.apache.spark.sql.SparkSession spark, org.apache.spark.sql.catalyst.TableIdentifier sourceTableIdent, Table targetTable, String stagingDir, Map<String,String> partitionFilter, boolean checkDuplicateFiles)
      Import files from an existing Spark table to an Iceberg table.

      The import uses the Spark session to get table metadata. It assumes no operation is going on the original and target table and thus is not thread-safe.

      Parameters:
      spark - a Spark session
      sourceTableIdent - an identifier of the source Spark table
      targetTable - an Iceberg table where to import the data
      stagingDir - a staging directory to store temporary manifest files
      partitionFilter - only import partitions whose values match those in the map, can be partially defined
      checkDuplicateFiles - if true, throw exception if import results in a duplicate data file
    • importSparkTable

      public static void importSparkTable(org.apache.spark.sql.SparkSession spark, org.apache.spark.sql.catalyst.TableIdentifier sourceTableIdent, Table targetTable, String stagingDir, int parallelism)
      Import files from an existing Spark table to an Iceberg table.

      The import uses the Spark session to get table metadata. It assumes no operation is going on the original and target table and thus is not thread-safe.

      Parameters:
      spark - a Spark session
      sourceTableIdent - an identifier of the source Spark table
      targetTable - an Iceberg table where to import the data
      stagingDir - a staging directory to store temporary manifest files
      parallelism - number of threads to use for file reading
    • importSparkTable

      public static void importSparkTable(org.apache.spark.sql.SparkSession spark, org.apache.spark.sql.catalyst.TableIdentifier sourceTableIdent, Table targetTable, String stagingDir, ExecutorService service)
      Import files from an existing Spark table to an Iceberg table.

      The import uses the Spark session to get table metadata. It assumes no operation is going on the original and target table and thus is not thread-safe.

      Parameters:
      spark - a Spark session
      sourceTableIdent - an identifier of the source Spark table
      targetTable - an Iceberg table where to import the data
      stagingDir - a staging directory to store temporary manifest files
      service - executor service to use for file reading
    • importSparkTable

      public static void importSparkTable(org.apache.spark.sql.SparkSession spark, org.apache.spark.sql.catalyst.TableIdentifier sourceTableIdent, Table targetTable, String stagingDir, Map<String,String> partitionFilter, boolean checkDuplicateFiles, int parallelism)
      Import files from an existing Spark table to an Iceberg table.

      The import uses the Spark session to get table metadata. It assumes no operation is going on the original and target table and thus is not thread-safe.

      Parameters:
      spark - a Spark session
      sourceTableIdent - an identifier of the source Spark table
      targetTable - an Iceberg table where to import the data
      stagingDir - a staging directory to store temporary manifest files
      partitionFilter - only import partitions whose values match those in the map, can be partially defined
      checkDuplicateFiles - if true, throw exception if import results in a duplicate data file
      parallelism - number of threads to use for file reading
    • importSparkTable

      public static void importSparkTable(org.apache.spark.sql.SparkSession spark, org.apache.spark.sql.catalyst.TableIdentifier sourceTableIdent, Table targetTable, String stagingDir, Map<String,String> partitionFilter, boolean checkDuplicateFiles, ExecutorService service)
      Import files from an existing Spark table to an Iceberg table.

      The import uses the Spark session to get table metadata. It assumes no operation is going on the original and target table and thus is not thread-safe.

      Parameters:
      spark - a Spark session
      sourceTableIdent - an identifier of the source Spark table
      targetTable - an Iceberg table where to import the data
      stagingDir - a staging directory to store temporary manifest files
      partitionFilter - only import partitions whose values match those in the map, can be partially defined
      checkDuplicateFiles - if true, throw exception if import results in a duplicate data file
      service - executor service to use for file reading
    • importSparkTable

      public static void importSparkTable(org.apache.spark.sql.SparkSession spark, org.apache.spark.sql.catalyst.TableIdentifier sourceTableIdent, Table targetTable, String stagingDir, boolean checkDuplicateFiles)
      Import files from an existing Spark table to an Iceberg table.

      The import uses the Spark session to get table metadata. It assumes no operation is going on the original and target table and thus is not thread-safe.

      Parameters:
      spark - a Spark session
      sourceTableIdent - an identifier of the source Spark table
      targetTable - an Iceberg table where to import the data
      stagingDir - a staging directory to store temporary manifest files
      checkDuplicateFiles - if true, throw exception if import results in a duplicate data file
    • importSparkTable

      public static void importSparkTable(org.apache.spark.sql.SparkSession spark, org.apache.spark.sql.catalyst.TableIdentifier sourceTableIdent, Table targetTable, String stagingDir)
      Import files from an existing Spark table to an Iceberg table.

      The import uses the Spark session to get table metadata. It assumes no operation is going on the original and target table and thus is not thread-safe.

      Parameters:
      spark - a Spark session
      sourceTableIdent - an identifier of the source Spark table
      targetTable - an Iceberg table where to import the data
      stagingDir - a staging directory to store temporary manifest files
    • importSparkPartitions

      public static void importSparkPartitions(org.apache.spark.sql.SparkSession spark, List<SparkTableUtil.SparkPartition> partitions, Table targetTable, PartitionSpec spec, String stagingDir, boolean checkDuplicateFiles)
      Import files from given partitions to an Iceberg table.
      Parameters:
      spark - a Spark session
      partitions - partitions to import
      targetTable - an Iceberg table where to import the data
      spec - a partition spec
      stagingDir - a staging directory to store temporary manifest files
      checkDuplicateFiles - if true, throw exception if import results in a duplicate data file
    • importSparkPartitions

      public static void importSparkPartitions(org.apache.spark.sql.SparkSession spark, List<SparkTableUtil.SparkPartition> partitions, Table targetTable, PartitionSpec spec, String stagingDir, boolean checkDuplicateFiles, int parallelism)
      Import files from given partitions to an Iceberg table.
      Parameters:
      spark - a Spark session
      partitions - partitions to import
      targetTable - an Iceberg table where to import the data
      spec - a partition spec
      stagingDir - a staging directory to store temporary manifest files
      checkDuplicateFiles - if true, throw exception if import results in a duplicate data file
      parallelism - number of threads to use for file reading
    • importSparkPartitions

      public static void importSparkPartitions(org.apache.spark.sql.SparkSession spark, List<SparkTableUtil.SparkPartition> partitions, Table targetTable, PartitionSpec spec, String stagingDir, boolean checkDuplicateFiles, ExecutorService service)
      Import files from given partitions to an Iceberg table.
      Parameters:
      spark - a Spark session
      partitions - partitions to import
      targetTable - an Iceberg table where to import the data
      spec - a partition spec
      stagingDir - a staging directory to store temporary manifest files
      checkDuplicateFiles - if true, throw exception if import results in a duplicate data file
      service - executor service to use for file reading
    • importSparkPartitions

      public static void importSparkPartitions(org.apache.spark.sql.SparkSession spark, List<SparkTableUtil.SparkPartition> partitions, Table targetTable, PartitionSpec spec, String stagingDir)
      Import files from given partitions to an Iceberg table.
      Parameters:
      spark - a Spark session
      partitions - partitions to import
      targetTable - an Iceberg table where to import the data
      spec - a partition spec
      stagingDir - a staging directory to store temporary manifest files
    • filterPartitions

      public static List<SparkTableUtil.SparkPartition> filterPartitions(List<SparkTableUtil.SparkPartition> partitions, Map<String,String> partitionFilter)
    • loadMetadataTable

      public static org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> loadMetadataTable(org.apache.spark.sql.SparkSession spark, Table table, MetadataTableType type)
    • loadMetadataTable

      public static org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> loadMetadataTable(org.apache.spark.sql.SparkSession spark, Table table, MetadataTableType type, Map<String,String> extraOptions)
    • determineWriteBranch

      public static String determineWriteBranch(org.apache.spark.sql.SparkSession spark, String branch)
      Determine the write branch.

      Validate wap config and determine the write branch.

      Parameters:
      spark - a Spark Session
      branch - write branch if there is no WAP branch configured
      Returns:
      branch for write operation
    • wapEnabled

      public static boolean wapEnabled(Table table)