Class SparkTableUtil


  • public class SparkTableUtil
    extends java.lang.Object
    Java version of the original SparkTableUtil.scala https://github.com/apache/iceberg/blob/apache-iceberg-0.8.0-incubating/spark/src/main/scala/org/apache/iceberg/spark/SparkTableUtil.scala
    • Nested Class Summary

      Nested Classes 
      Modifier and Type Class Description
      static class  SparkTableUtil.SparkPartition
      Class representing a table partition.
    • Method Summary

      All Methods Static Methods Concrete Methods 
      Modifier and Type Method Description
      static java.lang.String determineWriteBranch​(org.apache.spark.sql.SparkSession spark, java.lang.String branch)
      Determine the write branch.
      static java.util.List<SparkTableUtil.SparkPartition> filterPartitions​(java.util.List<SparkTableUtil.SparkPartition> partitions, java.util.Map<java.lang.String,​java.lang.String> partitionFilter)  
      static java.util.List<SparkTableUtil.SparkPartition> getPartitions​(org.apache.spark.sql.SparkSession spark, java.lang.String table)
      Returns all partitions in the table.
      static java.util.List<SparkTableUtil.SparkPartition> getPartitions​(org.apache.spark.sql.SparkSession spark, org.apache.spark.sql.catalyst.TableIdentifier tableIdent, java.util.Map<java.lang.String,​java.lang.String> partitionFilter)
      Returns all partitions in the table.
      static java.util.List<SparkTableUtil.SparkPartition> getPartitionsByFilter​(org.apache.spark.sql.SparkSession spark, java.lang.String table, java.lang.String predicate)
      Returns partitions that match the specified 'predicate'.
      static java.util.List<SparkTableUtil.SparkPartition> getPartitionsByFilter​(org.apache.spark.sql.SparkSession spark, org.apache.spark.sql.catalyst.TableIdentifier tableIdent, org.apache.spark.sql.catalyst.expressions.Expression predicateExpr)
      Returns partitions that match the specified 'predicate'.
      static void importSparkPartitions​(org.apache.spark.sql.SparkSession spark, java.util.List<SparkTableUtil.SparkPartition> partitions, Table targetTable, PartitionSpec spec, java.lang.String stagingDir)
      Import files from given partitions to an Iceberg table.
      static void importSparkPartitions​(org.apache.spark.sql.SparkSession spark, java.util.List<SparkTableUtil.SparkPartition> partitions, Table targetTable, PartitionSpec spec, java.lang.String stagingDir, boolean checkDuplicateFiles)
      Import files from given partitions to an Iceberg table.
      static void importSparkPartitions​(org.apache.spark.sql.SparkSession spark, java.util.List<SparkTableUtil.SparkPartition> partitions, Table targetTable, PartitionSpec spec, java.lang.String stagingDir, boolean checkDuplicateFiles, int parallelism)
      Import files from given partitions to an Iceberg table.
      static void importSparkPartitions​(org.apache.spark.sql.SparkSession spark, java.util.List<SparkTableUtil.SparkPartition> partitions, Table targetTable, PartitionSpec spec, java.lang.String stagingDir, boolean checkDuplicateFiles, java.util.concurrent.ExecutorService service)
      Import files from given partitions to an Iceberg table.
      static void importSparkTable​(org.apache.spark.sql.SparkSession spark, org.apache.spark.sql.catalyst.TableIdentifier sourceTableIdent, Table targetTable, java.lang.String stagingDir)
      Import files from an existing Spark table to an Iceberg table.
      static void importSparkTable​(org.apache.spark.sql.SparkSession spark, org.apache.spark.sql.catalyst.TableIdentifier sourceTableIdent, Table targetTable, java.lang.String stagingDir, boolean checkDuplicateFiles)
      Import files from an existing Spark table to an Iceberg table.
      static void importSparkTable​(org.apache.spark.sql.SparkSession spark, org.apache.spark.sql.catalyst.TableIdentifier sourceTableIdent, Table targetTable, java.lang.String stagingDir, int parallelism)
      Import files from an existing Spark table to an Iceberg table.
      static void importSparkTable​(org.apache.spark.sql.SparkSession spark, org.apache.spark.sql.catalyst.TableIdentifier sourceTableIdent, Table targetTable, java.lang.String stagingDir, java.util.concurrent.ExecutorService service)
      Import files from an existing Spark table to an Iceberg table.
      static void importSparkTable​(org.apache.spark.sql.SparkSession spark, org.apache.spark.sql.catalyst.TableIdentifier sourceTableIdent, Table targetTable, java.lang.String stagingDir, java.util.Map<java.lang.String,​java.lang.String> partitionFilter, boolean checkDuplicateFiles)
      Import files from an existing Spark table to an Iceberg table.
      static void importSparkTable​(org.apache.spark.sql.SparkSession spark, org.apache.spark.sql.catalyst.TableIdentifier sourceTableIdent, Table targetTable, java.lang.String stagingDir, java.util.Map<java.lang.String,​java.lang.String> partitionFilter, boolean checkDuplicateFiles, int parallelism)
      Import files from an existing Spark table to an Iceberg table.
      static void importSparkTable​(org.apache.spark.sql.SparkSession spark, org.apache.spark.sql.catalyst.TableIdentifier sourceTableIdent, Table targetTable, java.lang.String stagingDir, java.util.Map<java.lang.String,​java.lang.String> partitionFilter, boolean checkDuplicateFiles, java.util.concurrent.ExecutorService service)
      Import files from an existing Spark table to an Iceberg table.
      static org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> loadMetadataTable​(org.apache.spark.sql.SparkSession spark, Table table, MetadataTableType type)  
      static org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> loadMetadataTable​(org.apache.spark.sql.SparkSession spark, Table table, MetadataTableType type, java.util.Map<java.lang.String,​java.lang.String> extraOptions)  
      static org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> loadTable​(org.apache.spark.sql.SparkSession spark, Table table, long snapshotId)  
      static org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> partitionDF​(org.apache.spark.sql.SparkSession spark, java.lang.String table)
      Returns a DataFrame with a row for each partition in the table.
      static org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> partitionDFByFilter​(org.apache.spark.sql.SparkSession spark, java.lang.String table, java.lang.String expression)
      Returns a DataFrame with a row for each partition that matches the specified 'expression'.
      static boolean wapEnabled​(Table table)  
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Method Detail

      • partitionDF

        public static org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> partitionDF​(org.apache.spark.sql.SparkSession spark,
                                                                                         java.lang.String table)
        Returns a DataFrame with a row for each partition in the table.

        The DataFrame has 3 columns, partition key (a=1/b=2), partition location, and format (avro or parquet).

        Parameters:
        spark - a Spark session
        table - a table name and (optional) database
        Returns:
        a DataFrame of the table's partitions
      • partitionDFByFilter

        public static org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> partitionDFByFilter​(org.apache.spark.sql.SparkSession spark,
                                                                                                 java.lang.String table,
                                                                                                 java.lang.String expression)
        Returns a DataFrame with a row for each partition that matches the specified 'expression'.
        Parameters:
        spark - a Spark session.
        table - name of the table.
        expression - The expression whose matching partitions are returned.
        Returns:
        a DataFrame of the table partitions.
      • getPartitions

        public static java.util.List<SparkTableUtil.SparkPartition> getPartitions​(org.apache.spark.sql.SparkSession spark,
                                                                                  java.lang.String table)
        Returns all partitions in the table.
        Parameters:
        spark - a Spark session
        table - a table name and (optional) database
        Returns:
        all table's partitions
      • getPartitions

        public static java.util.List<SparkTableUtil.SparkPartition> getPartitions​(org.apache.spark.sql.SparkSession spark,
                                                                                  org.apache.spark.sql.catalyst.TableIdentifier tableIdent,
                                                                                  java.util.Map<java.lang.String,​java.lang.String> partitionFilter)
        Returns all partitions in the table.
        Parameters:
        spark - a Spark session
        tableIdent - a table identifier
        partitionFilter - partition filter, or null if no filter
        Returns:
        all table's partitions
      • getPartitionsByFilter

        public static java.util.List<SparkTableUtil.SparkPartition> getPartitionsByFilter​(org.apache.spark.sql.SparkSession spark,
                                                                                          java.lang.String table,
                                                                                          java.lang.String predicate)
        Returns partitions that match the specified 'predicate'.
        Parameters:
        spark - a Spark session
        table - a table name and (optional) database
        predicate - a predicate on partition columns
        Returns:
        matching table's partitions
      • getPartitionsByFilter

        public static java.util.List<SparkTableUtil.SparkPartition> getPartitionsByFilter​(org.apache.spark.sql.SparkSession spark,
                                                                                          org.apache.spark.sql.catalyst.TableIdentifier tableIdent,
                                                                                          org.apache.spark.sql.catalyst.expressions.Expression predicateExpr)
        Returns partitions that match the specified 'predicate'.
        Parameters:
        spark - a Spark session
        tableIdent - a table identifier
        predicateExpr - a predicate expression on partition columns
        Returns:
        matching table's partitions
      • importSparkTable

        public static void importSparkTable​(org.apache.spark.sql.SparkSession spark,
                                            org.apache.spark.sql.catalyst.TableIdentifier sourceTableIdent,
                                            Table targetTable,
                                            java.lang.String stagingDir,
                                            java.util.Map<java.lang.String,​java.lang.String> partitionFilter,
                                            boolean checkDuplicateFiles)
        Import files from an existing Spark table to an Iceberg table.

        The import uses the Spark session to get table metadata. It assumes no operation is going on the original and target table and thus is not thread-safe.

        Parameters:
        spark - a Spark session
        sourceTableIdent - an identifier of the source Spark table
        targetTable - an Iceberg table where to import the data
        stagingDir - a staging directory to store temporary manifest files
        partitionFilter - only import partitions whose values match those in the map, can be partially defined
        checkDuplicateFiles - if true, throw exception if import results in a duplicate data file
      • importSparkTable

        public static void importSparkTable​(org.apache.spark.sql.SparkSession spark,
                                            org.apache.spark.sql.catalyst.TableIdentifier sourceTableIdent,
                                            Table targetTable,
                                            java.lang.String stagingDir,
                                            int parallelism)
        Import files from an existing Spark table to an Iceberg table.

        The import uses the Spark session to get table metadata. It assumes no operation is going on the original and target table and thus is not thread-safe.

        Parameters:
        spark - a Spark session
        sourceTableIdent - an identifier of the source Spark table
        targetTable - an Iceberg table where to import the data
        stagingDir - a staging directory to store temporary manifest files
        parallelism - number of threads to use for file reading
      • importSparkTable

        public static void importSparkTable​(org.apache.spark.sql.SparkSession spark,
                                            org.apache.spark.sql.catalyst.TableIdentifier sourceTableIdent,
                                            Table targetTable,
                                            java.lang.String stagingDir,
                                            java.util.concurrent.ExecutorService service)
        Import files from an existing Spark table to an Iceberg table.

        The import uses the Spark session to get table metadata. It assumes no operation is going on the original and target table and thus is not thread-safe.

        Parameters:
        spark - a Spark session
        sourceTableIdent - an identifier of the source Spark table
        targetTable - an Iceberg table where to import the data
        stagingDir - a staging directory to store temporary manifest files
        service - executor service to use for file reading. If null, file reading will be performed on the current thread. * If non-null, the provided ExecutorService will be shutdown within this method after file reading is complete.
      • importSparkTable

        public static void importSparkTable​(org.apache.spark.sql.SparkSession spark,
                                            org.apache.spark.sql.catalyst.TableIdentifier sourceTableIdent,
                                            Table targetTable,
                                            java.lang.String stagingDir,
                                            java.util.Map<java.lang.String,​java.lang.String> partitionFilter,
                                            boolean checkDuplicateFiles,
                                            int parallelism)
        Import files from an existing Spark table to an Iceberg table.

        The import uses the Spark session to get table metadata. It assumes no operation is going on the original and target table and thus is not thread-safe.

        Parameters:
        spark - a Spark session
        sourceTableIdent - an identifier of the source Spark table
        targetTable - an Iceberg table where to import the data
        stagingDir - a staging directory to store temporary manifest files
        partitionFilter - only import partitions whose values match those in the map, can be partially defined
        checkDuplicateFiles - if true, throw exception if import results in a duplicate data file
        parallelism - number of threads to use for file reading
      • importSparkTable

        public static void importSparkTable​(org.apache.spark.sql.SparkSession spark,
                                            org.apache.spark.sql.catalyst.TableIdentifier sourceTableIdent,
                                            Table targetTable,
                                            java.lang.String stagingDir,
                                            java.util.Map<java.lang.String,​java.lang.String> partitionFilter,
                                            boolean checkDuplicateFiles,
                                            java.util.concurrent.ExecutorService service)
        Import files from an existing Spark table to an Iceberg table.

        The import uses the Spark session to get table metadata. It assumes no operation is going on the original and target table and thus is not thread-safe.

        Parameters:
        spark - a Spark session
        sourceTableIdent - an identifier of the source Spark table
        targetTable - an Iceberg table where to import the data
        stagingDir - a staging directory to store temporary manifest files
        partitionFilter - only import partitions whose values match those in the map, can be partially defined
        checkDuplicateFiles - if true, throw exception if import results in a duplicate data file
        service - executor service to use for file reading. If null, file reading will be performed on the current thread. If non-null, the provided ExecutorService will be shutdown within this method after file reading is complete.
      • importSparkTable

        public static void importSparkTable​(org.apache.spark.sql.SparkSession spark,
                                            org.apache.spark.sql.catalyst.TableIdentifier sourceTableIdent,
                                            Table targetTable,
                                            java.lang.String stagingDir,
                                            boolean checkDuplicateFiles)
        Import files from an existing Spark table to an Iceberg table.

        The import uses the Spark session to get table metadata. It assumes no operation is going on the original and target table and thus is not thread-safe.

        Parameters:
        spark - a Spark session
        sourceTableIdent - an identifier of the source Spark table
        targetTable - an Iceberg table where to import the data
        stagingDir - a staging directory to store temporary manifest files
        checkDuplicateFiles - if true, throw exception if import results in a duplicate data file
      • importSparkTable

        public static void importSparkTable​(org.apache.spark.sql.SparkSession spark,
                                            org.apache.spark.sql.catalyst.TableIdentifier sourceTableIdent,
                                            Table targetTable,
                                            java.lang.String stagingDir)
        Import files from an existing Spark table to an Iceberg table.

        The import uses the Spark session to get table metadata. It assumes no operation is going on the original and target table and thus is not thread-safe.

        Parameters:
        spark - a Spark session
        sourceTableIdent - an identifier of the source Spark table
        targetTable - an Iceberg table where to import the data
        stagingDir - a staging directory to store temporary manifest files
      • importSparkPartitions

        public static void importSparkPartitions​(org.apache.spark.sql.SparkSession spark,
                                                 java.util.List<SparkTableUtil.SparkPartition> partitions,
                                                 Table targetTable,
                                                 PartitionSpec spec,
                                                 java.lang.String stagingDir,
                                                 boolean checkDuplicateFiles)
        Import files from given partitions to an Iceberg table.
        Parameters:
        spark - a Spark session
        partitions - partitions to import
        targetTable - an Iceberg table where to import the data
        spec - a partition spec
        stagingDir - a staging directory to store temporary manifest files
        checkDuplicateFiles - if true, throw exception if import results in a duplicate data file
      • importSparkPartitions

        public static void importSparkPartitions​(org.apache.spark.sql.SparkSession spark,
                                                 java.util.List<SparkTableUtil.SparkPartition> partitions,
                                                 Table targetTable,
                                                 PartitionSpec spec,
                                                 java.lang.String stagingDir,
                                                 boolean checkDuplicateFiles,
                                                 int parallelism)
        Import files from given partitions to an Iceberg table.
        Parameters:
        spark - a Spark session
        partitions - partitions to import
        targetTable - an Iceberg table where to import the data
        spec - a partition spec
        stagingDir - a staging directory to store temporary manifest files
        checkDuplicateFiles - if true, throw exception if import results in a duplicate data file
        parallelism - number of threads to use for file reading
      • importSparkPartitions

        public static void importSparkPartitions​(org.apache.spark.sql.SparkSession spark,
                                                 java.util.List<SparkTableUtil.SparkPartition> partitions,
                                                 Table targetTable,
                                                 PartitionSpec spec,
                                                 java.lang.String stagingDir,
                                                 boolean checkDuplicateFiles,
                                                 java.util.concurrent.ExecutorService service)
        Import files from given partitions to an Iceberg table.
        Parameters:
        spark - a Spark session
        partitions - partitions to import
        targetTable - an Iceberg table where to import the data
        spec - a partition spec
        stagingDir - a staging directory to store temporary manifest files
        checkDuplicateFiles - if true, throw exception if import results in a duplicate data file
        service - executor service to use for file reading. If null, file reading will be performed on the current thread. If non-null, the provided ExecutorService will be shutdown within this method after file reading is complete.
      • importSparkPartitions

        public static void importSparkPartitions​(org.apache.spark.sql.SparkSession spark,
                                                 java.util.List<SparkTableUtil.SparkPartition> partitions,
                                                 Table targetTable,
                                                 PartitionSpec spec,
                                                 java.lang.String stagingDir)
        Import files from given partitions to an Iceberg table.
        Parameters:
        spark - a Spark session
        partitions - partitions to import
        targetTable - an Iceberg table where to import the data
        spec - a partition spec
        stagingDir - a staging directory to store temporary manifest files
      • loadTable

        public static org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> loadTable​(org.apache.spark.sql.SparkSession spark,
                                                                                       Table table,
                                                                                       long snapshotId)
      • loadMetadataTable

        public static org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> loadMetadataTable​(org.apache.spark.sql.SparkSession spark,
                                                                                               Table table,
                                                                                               MetadataTableType type)
      • loadMetadataTable

        public static org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> loadMetadataTable​(org.apache.spark.sql.SparkSession spark,
                                                                                               Table table,
                                                                                               MetadataTableType type,
                                                                                               java.util.Map<java.lang.String,​java.lang.String> extraOptions)
      • determineWriteBranch

        public static java.lang.String determineWriteBranch​(org.apache.spark.sql.SparkSession spark,
                                                            java.lang.String branch)
        Determine the write branch.

        Validate wap config and determine the write branch.

        Parameters:
        spark - a Spark Session
        branch - write branch if there is no WAP branch configured
        Returns:
        branch for write operation
      • wapEnabled

        public static boolean wapEnabled​(Table table)