Class SparkTableUtil
-
Nested Class Summary
Modifier and TypeClassDescriptionstatic class
Class representing a table partition. -
Method Summary
Modifier and TypeMethodDescriptionstatic String
determineWriteBranch
(org.apache.spark.sql.SparkSession spark, String branch) Determine the write branch.static List<SparkTableUtil.SparkPartition>
filterPartitions
(List<SparkTableUtil.SparkPartition> partitions, Map<String, String> partitionFilter) static List<SparkTableUtil.SparkPartition>
getPartitions
(org.apache.spark.sql.SparkSession spark, String table) Returns all partitions in the table.static List<SparkTableUtil.SparkPartition>
getPartitions
(org.apache.spark.sql.SparkSession spark, org.apache.spark.sql.catalyst.TableIdentifier tableIdent, Map<String, String> partitionFilter) Returns all partitions in the table.static List<SparkTableUtil.SparkPartition>
getPartitionsByFilter
(org.apache.spark.sql.SparkSession spark, String table, String predicate) Returns partitions that match the specified 'predicate'.static List<SparkTableUtil.SparkPartition>
getPartitionsByFilter
(org.apache.spark.sql.SparkSession spark, org.apache.spark.sql.catalyst.TableIdentifier tableIdent, org.apache.spark.sql.catalyst.expressions.Expression predicateExpr) Returns partitions that match the specified 'predicate'.static void
importSparkPartitions
(org.apache.spark.sql.SparkSession spark, List<SparkTableUtil.SparkPartition> partitions, Table targetTable, PartitionSpec spec, String stagingDir) Import files from given partitions to an Iceberg table.static void
importSparkPartitions
(org.apache.spark.sql.SparkSession spark, List<SparkTableUtil.SparkPartition> partitions, Table targetTable, PartitionSpec spec, String stagingDir, boolean checkDuplicateFiles) Import files from given partitions to an Iceberg table.static void
importSparkPartitions
(org.apache.spark.sql.SparkSession spark, List<SparkTableUtil.SparkPartition> partitions, Table targetTable, PartitionSpec spec, String stagingDir, boolean checkDuplicateFiles, int parallelism) Import files from given partitions to an Iceberg table.static void
importSparkPartitions
(org.apache.spark.sql.SparkSession spark, List<SparkTableUtil.SparkPartition> partitions, Table targetTable, PartitionSpec spec, String stagingDir, boolean checkDuplicateFiles, ExecutorService service) Import files from given partitions to an Iceberg table.static void
importSparkTable
(org.apache.spark.sql.SparkSession spark, org.apache.spark.sql.catalyst.TableIdentifier sourceTableIdent, Table targetTable, String stagingDir) Import files from an existing Spark table to an Iceberg table.static void
importSparkTable
(org.apache.spark.sql.SparkSession spark, org.apache.spark.sql.catalyst.TableIdentifier sourceTableIdent, Table targetTable, String stagingDir, boolean checkDuplicateFiles) Import files from an existing Spark table to an Iceberg table.static void
importSparkTable
(org.apache.spark.sql.SparkSession spark, org.apache.spark.sql.catalyst.TableIdentifier sourceTableIdent, Table targetTable, String stagingDir, int parallelism) Import files from an existing Spark table to an Iceberg table.static void
importSparkTable
(org.apache.spark.sql.SparkSession spark, org.apache.spark.sql.catalyst.TableIdentifier sourceTableIdent, Table targetTable, String stagingDir, ExecutorService service) Import files from an existing Spark table to an Iceberg table.static void
importSparkTable
(org.apache.spark.sql.SparkSession spark, org.apache.spark.sql.catalyst.TableIdentifier sourceTableIdent, Table targetTable, String stagingDir, Map<String, String> partitionFilter, boolean checkDuplicateFiles) Import files from an existing Spark table to an Iceberg table.static void
importSparkTable
(org.apache.spark.sql.SparkSession spark, org.apache.spark.sql.catalyst.TableIdentifier sourceTableIdent, Table targetTable, String stagingDir, Map<String, String> partitionFilter, boolean checkDuplicateFiles, int parallelism) Import files from an existing Spark table to an Iceberg table.static void
importSparkTable
(org.apache.spark.sql.SparkSession spark, org.apache.spark.sql.catalyst.TableIdentifier sourceTableIdent, Table targetTable, String stagingDir, Map<String, String> partitionFilter, boolean checkDuplicateFiles, ExecutorService service) Import files from an existing Spark table to an Iceberg table.static org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>
loadMetadataTable
(org.apache.spark.sql.SparkSession spark, Table table, MetadataTableType type) static org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>
loadMetadataTable
(org.apache.spark.sql.SparkSession spark, Table table, MetadataTableType type, Map<String, String> extraOptions) static org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>
partitionDF
(org.apache.spark.sql.SparkSession spark, String table) Returns a DataFrame with a row for each partition in the table.static org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>
partitionDFByFilter
(org.apache.spark.sql.SparkSession spark, String table, String expression) Returns a DataFrame with a row for each partition that matches the specified 'expression'.static boolean
wapEnabled
(Table table)
-
Method Details
-
partitionDF
public static org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> partitionDF(org.apache.spark.sql.SparkSession spark, String table) Returns a DataFrame with a row for each partition in the table.The DataFrame has 3 columns, partition key (a=1/b=2), partition location, and format (avro or parquet).
- Parameters:
spark
- a Spark sessiontable
- a table name and (optional) database- Returns:
- a DataFrame of the table's partitions
-
partitionDFByFilter
public static org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> partitionDFByFilter(org.apache.spark.sql.SparkSession spark, String table, String expression) Returns a DataFrame with a row for each partition that matches the specified 'expression'.- Parameters:
spark
- a Spark session.table
- name of the table.expression
- The expression whose matching partitions are returned.- Returns:
- a DataFrame of the table partitions.
-
getPartitions
public static List<SparkTableUtil.SparkPartition> getPartitions(org.apache.spark.sql.SparkSession spark, String table) Returns all partitions in the table.- Parameters:
spark
- a Spark sessiontable
- a table name and (optional) database- Returns:
- all table's partitions
-
getPartitions
public static List<SparkTableUtil.SparkPartition> getPartitions(org.apache.spark.sql.SparkSession spark, org.apache.spark.sql.catalyst.TableIdentifier tableIdent, Map<String, String> partitionFilter) Returns all partitions in the table.- Parameters:
spark
- a Spark sessiontableIdent
- a table identifierpartitionFilter
- partition filter, or null if no filter- Returns:
- all table's partitions
-
getPartitionsByFilter
public static List<SparkTableUtil.SparkPartition> getPartitionsByFilter(org.apache.spark.sql.SparkSession spark, String table, String predicate) Returns partitions that match the specified 'predicate'.- Parameters:
spark
- a Spark sessiontable
- a table name and (optional) databasepredicate
- a predicate on partition columns- Returns:
- matching table's partitions
-
getPartitionsByFilter
public static List<SparkTableUtil.SparkPartition> getPartitionsByFilter(org.apache.spark.sql.SparkSession spark, org.apache.spark.sql.catalyst.TableIdentifier tableIdent, org.apache.spark.sql.catalyst.expressions.Expression predicateExpr) Returns partitions that match the specified 'predicate'.- Parameters:
spark
- a Spark sessiontableIdent
- a table identifierpredicateExpr
- a predicate expression on partition columns- Returns:
- matching table's partitions
-
importSparkTable
public static void importSparkTable(org.apache.spark.sql.SparkSession spark, org.apache.spark.sql.catalyst.TableIdentifier sourceTableIdent, Table targetTable, String stagingDir, Map<String, String> partitionFilter, boolean checkDuplicateFiles) Import files from an existing Spark table to an Iceberg table.The import uses the Spark session to get table metadata. It assumes no operation is going on the original and target table and thus is not thread-safe.
- Parameters:
spark
- a Spark sessionsourceTableIdent
- an identifier of the source Spark tabletargetTable
- an Iceberg table where to import the datastagingDir
- a staging directory to store temporary manifest filespartitionFilter
- only import partitions whose values match those in the map, can be partially definedcheckDuplicateFiles
- if true, throw exception if import results in a duplicate data file
-
importSparkTable
public static void importSparkTable(org.apache.spark.sql.SparkSession spark, org.apache.spark.sql.catalyst.TableIdentifier sourceTableIdent, Table targetTable, String stagingDir, int parallelism) Import files from an existing Spark table to an Iceberg table.The import uses the Spark session to get table metadata. It assumes no operation is going on the original and target table and thus is not thread-safe.
- Parameters:
spark
- a Spark sessionsourceTableIdent
- an identifier of the source Spark tabletargetTable
- an Iceberg table where to import the datastagingDir
- a staging directory to store temporary manifest filesparallelism
- number of threads to use for file reading
-
importSparkTable
public static void importSparkTable(org.apache.spark.sql.SparkSession spark, org.apache.spark.sql.catalyst.TableIdentifier sourceTableIdent, Table targetTable, String stagingDir, ExecutorService service) Import files from an existing Spark table to an Iceberg table.The import uses the Spark session to get table metadata. It assumes no operation is going on the original and target table and thus is not thread-safe.
- Parameters:
spark
- a Spark sessionsourceTableIdent
- an identifier of the source Spark tabletargetTable
- an Iceberg table where to import the datastagingDir
- a staging directory to store temporary manifest filesservice
- executor service to use for file reading
-
importSparkTable
public static void importSparkTable(org.apache.spark.sql.SparkSession spark, org.apache.spark.sql.catalyst.TableIdentifier sourceTableIdent, Table targetTable, String stagingDir, Map<String, String> partitionFilter, boolean checkDuplicateFiles, int parallelism) Import files from an existing Spark table to an Iceberg table.The import uses the Spark session to get table metadata. It assumes no operation is going on the original and target table and thus is not thread-safe.
- Parameters:
spark
- a Spark sessionsourceTableIdent
- an identifier of the source Spark tabletargetTable
- an Iceberg table where to import the datastagingDir
- a staging directory to store temporary manifest filespartitionFilter
- only import partitions whose values match those in the map, can be partially definedcheckDuplicateFiles
- if true, throw exception if import results in a duplicate data fileparallelism
- number of threads to use for file reading
-
importSparkTable
public static void importSparkTable(org.apache.spark.sql.SparkSession spark, org.apache.spark.sql.catalyst.TableIdentifier sourceTableIdent, Table targetTable, String stagingDir, Map<String, String> partitionFilter, boolean checkDuplicateFiles, ExecutorService service) Import files from an existing Spark table to an Iceberg table.The import uses the Spark session to get table metadata. It assumes no operation is going on the original and target table and thus is not thread-safe.
- Parameters:
spark
- a Spark sessionsourceTableIdent
- an identifier of the source Spark tabletargetTable
- an Iceberg table where to import the datastagingDir
- a staging directory to store temporary manifest filespartitionFilter
- only import partitions whose values match those in the map, can be partially definedcheckDuplicateFiles
- if true, throw exception if import results in a duplicate data fileservice
- executor service to use for file reading
-
importSparkTable
public static void importSparkTable(org.apache.spark.sql.SparkSession spark, org.apache.spark.sql.catalyst.TableIdentifier sourceTableIdent, Table targetTable, String stagingDir, boolean checkDuplicateFiles) Import files from an existing Spark table to an Iceberg table.The import uses the Spark session to get table metadata. It assumes no operation is going on the original and target table and thus is not thread-safe.
- Parameters:
spark
- a Spark sessionsourceTableIdent
- an identifier of the source Spark tabletargetTable
- an Iceberg table where to import the datastagingDir
- a staging directory to store temporary manifest filescheckDuplicateFiles
- if true, throw exception if import results in a duplicate data file
-
importSparkTable
public static void importSparkTable(org.apache.spark.sql.SparkSession spark, org.apache.spark.sql.catalyst.TableIdentifier sourceTableIdent, Table targetTable, String stagingDir) Import files from an existing Spark table to an Iceberg table.The import uses the Spark session to get table metadata. It assumes no operation is going on the original and target table and thus is not thread-safe.
- Parameters:
spark
- a Spark sessionsourceTableIdent
- an identifier of the source Spark tabletargetTable
- an Iceberg table where to import the datastagingDir
- a staging directory to store temporary manifest files
-
importSparkPartitions
public static void importSparkPartitions(org.apache.spark.sql.SparkSession spark, List<SparkTableUtil.SparkPartition> partitions, Table targetTable, PartitionSpec spec, String stagingDir, boolean checkDuplicateFiles) Import files from given partitions to an Iceberg table.- Parameters:
spark
- a Spark sessionpartitions
- partitions to importtargetTable
- an Iceberg table where to import the dataspec
- a partition specstagingDir
- a staging directory to store temporary manifest filescheckDuplicateFiles
- if true, throw exception if import results in a duplicate data file
-
importSparkPartitions
public static void importSparkPartitions(org.apache.spark.sql.SparkSession spark, List<SparkTableUtil.SparkPartition> partitions, Table targetTable, PartitionSpec spec, String stagingDir, boolean checkDuplicateFiles, int parallelism) Import files from given partitions to an Iceberg table.- Parameters:
spark
- a Spark sessionpartitions
- partitions to importtargetTable
- an Iceberg table where to import the dataspec
- a partition specstagingDir
- a staging directory to store temporary manifest filescheckDuplicateFiles
- if true, throw exception if import results in a duplicate data fileparallelism
- number of threads to use for file reading
-
importSparkPartitions
public static void importSparkPartitions(org.apache.spark.sql.SparkSession spark, List<SparkTableUtil.SparkPartition> partitions, Table targetTable, PartitionSpec spec, String stagingDir, boolean checkDuplicateFiles, ExecutorService service) Import files from given partitions to an Iceberg table.- Parameters:
spark
- a Spark sessionpartitions
- partitions to importtargetTable
- an Iceberg table where to import the dataspec
- a partition specstagingDir
- a staging directory to store temporary manifest filescheckDuplicateFiles
- if true, throw exception if import results in a duplicate data fileservice
- executor service to use for file reading
-
importSparkPartitions
public static void importSparkPartitions(org.apache.spark.sql.SparkSession spark, List<SparkTableUtil.SparkPartition> partitions, Table targetTable, PartitionSpec spec, String stagingDir) Import files from given partitions to an Iceberg table.- Parameters:
spark
- a Spark sessionpartitions
- partitions to importtargetTable
- an Iceberg table where to import the dataspec
- a partition specstagingDir
- a staging directory to store temporary manifest files
-
filterPartitions
public static List<SparkTableUtil.SparkPartition> filterPartitions(List<SparkTableUtil.SparkPartition> partitions, Map<String, String> partitionFilter) -
loadMetadataTable
public static org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> loadMetadataTable(org.apache.spark.sql.SparkSession spark, Table table, MetadataTableType type) -
loadMetadataTable
public static org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> loadMetadataTable(org.apache.spark.sql.SparkSession spark, Table table, MetadataTableType type, Map<String, String> extraOptions) -
determineWriteBranch
Determine the write branch.Validate wap config and determine the write branch.
- Parameters:
spark
- a Spark Sessionbranch
- write branch if there is no WAP branch configured- Returns:
- branch for write operation
-
wapEnabled
-