Class SparkTableUtil
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic classClass representing a table partition. -
Method Summary
Modifier and TypeMethodDescriptionstatic StringdetermineReadBranch(org.apache.spark.sql.SparkSession spark, SparkTable sparkTable, org.apache.spark.sql.util.CaseInsensitiveStringMap options) static StringdetermineReadBranch(org.apache.spark.sql.SparkSession spark, Table table, String branch, org.apache.spark.sql.util.CaseInsensitiveStringMap options) Determine the read branch.static StringdetermineWriteBranch(org.apache.spark.sql.SparkSession spark, SparkTable sparkTable, org.apache.spark.sql.util.CaseInsensitiveStringMap options) static StringdetermineWriteBranch(org.apache.spark.sql.SparkSession spark, Table table, String branch, org.apache.spark.sql.util.CaseInsensitiveStringMap options) Determine the write branch.static List<SparkTableUtil.SparkPartition> filterPartitions(List<SparkTableUtil.SparkPartition> partitions, Map<String, String> partitionFilter) Deprecated.since 1.11.0, will be removed in 1.12.0static PartitionSpecfindCompatibleSpec(List<String> partitionNames, Table icebergTable) Returns the first partition spec in an IcebergTable that shares the same names and ordering as the partition columns provided.static List<SparkTableUtil.SparkPartition> getPartitions(org.apache.spark.sql.SparkSession spark, String table) Deprecated.since 1.11.0, will be removed in 1.12.0static List<SparkTableUtil.SparkPartition> getPartitions(org.apache.spark.sql.SparkSession spark, org.apache.spark.sql.catalyst.TableIdentifier tableIdent, Map<String, String> partitionFilter) Returns all partitions in the table.static List<SparkTableUtil.SparkPartition> getPartitionsByFilter(org.apache.spark.sql.SparkSession spark, String table, String predicate) Deprecated.since 1.11.0, will be removed in 1.12.0static List<SparkTableUtil.SparkPartition> getPartitionsByFilter(org.apache.spark.sql.SparkSession spark, org.apache.spark.sql.catalyst.TableIdentifier tableIdent, org.apache.spark.sql.catalyst.expressions.Expression predicateExpr) Deprecated.since 1.11.0, will be removed in 1.12.0static voidimportSparkPartitions(org.apache.spark.sql.SparkSession spark, List<SparkTableUtil.SparkPartition> partitions, Table targetTable, PartitionSpec spec, String stagingDir) Deprecated.since 1.11.0, will be removed in 1.12.0static voidimportSparkPartitions(org.apache.spark.sql.SparkSession spark, List<SparkTableUtil.SparkPartition> partitions, Table targetTable, PartitionSpec spec, String stagingDir, boolean checkDuplicateFiles) Deprecated.since 1.11.0, will be removed in 1.12.0static voidimportSparkPartitions(org.apache.spark.sql.SparkSession spark, List<SparkTableUtil.SparkPartition> partitions, Table targetTable, PartitionSpec spec, String stagingDir, boolean checkDuplicateFiles, boolean ignoreMissingFiles, ExecutorService service) Import files from given partitions to an Iceberg table.static voidimportSparkPartitions(org.apache.spark.sql.SparkSession spark, List<SparkTableUtil.SparkPartition> partitions, Table targetTable, PartitionSpec spec, String stagingDir, boolean checkDuplicateFiles, int parallelism) Import files from given partitions to an Iceberg table.static voidimportSparkPartitions(org.apache.spark.sql.SparkSession spark, List<SparkTableUtil.SparkPartition> partitions, Table targetTable, PartitionSpec spec, String stagingDir, boolean checkDuplicateFiles, ExecutorService service) Import files from given partitions to an Iceberg table.static voidimportSparkTable(org.apache.spark.sql.SparkSession spark, org.apache.spark.sql.catalyst.TableIdentifier sourceTableIdent, Table targetTable, String stagingDir) Import files from an existing Spark table to an Iceberg table.static voidimportSparkTable(org.apache.spark.sql.SparkSession spark, org.apache.spark.sql.catalyst.TableIdentifier sourceTableIdent, Table targetTable, String stagingDir, boolean checkDuplicateFiles) Deprecated.since 1.11.0, will be removed in 1.12.0static voidimportSparkTable(org.apache.spark.sql.SparkSession spark, org.apache.spark.sql.catalyst.TableIdentifier sourceTableIdent, Table targetTable, String stagingDir, int parallelism) Deprecated.since 1.11.0, will be removed in 1.12.0static voidimportSparkTable(org.apache.spark.sql.SparkSession spark, org.apache.spark.sql.catalyst.TableIdentifier sourceTableIdent, Table targetTable, String stagingDir, ExecutorService service) Import files from an existing Spark table to an Iceberg table.static voidimportSparkTable(org.apache.spark.sql.SparkSession spark, org.apache.spark.sql.catalyst.TableIdentifier sourceTableIdent, Table targetTable, String stagingDir, Map<String, String> partitionFilter, boolean checkDuplicateFiles) Deprecated.since 1.11.0, will be removed in 1.12.0static voidimportSparkTable(org.apache.spark.sql.SparkSession spark, org.apache.spark.sql.catalyst.TableIdentifier sourceTableIdent, Table targetTable, String stagingDir, Map<String, String> partitionFilter, boolean checkDuplicateFiles, boolean ignoreMissingFiles, ExecutorService service) Import files from an existing Spark table to an Iceberg table.static voidimportSparkTable(org.apache.spark.sql.SparkSession spark, org.apache.spark.sql.catalyst.TableIdentifier sourceTableIdent, Table targetTable, String stagingDir, Map<String, String> partitionFilter, boolean checkDuplicateFiles, int parallelism) Import files from an existing Spark table to an Iceberg table.static voidimportSparkTable(org.apache.spark.sql.SparkSession spark, org.apache.spark.sql.catalyst.TableIdentifier sourceTableIdent, Table targetTable, String stagingDir, Map<String, String> partitionFilter, boolean checkDuplicateFiles, ExecutorService service) Import files from an existing Spark table to an Iceberg table.static org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> loadMetadataTable(org.apache.spark.sql.SparkSession spark, Table table, MetadataTableType type) static org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> loadMetadataTable(org.apache.spark.sql.SparkSession spark, Table table, MetadataTableType type, Map<String, String> extraOptions) static org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> static ExecutorServicemigrationService(int parallelism) static org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> partitionDF(org.apache.spark.sql.SparkSession spark, String table) Deprecated.since 1.11.0, will be removed in 1.12.0static org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> partitionDFByFilter(org.apache.spark.sql.SparkSession spark, String table, String expression) Deprecated.since 1.11.0, will be removed in 1.12.0static voidvalidatePartitionFilter(PartitionSpec spec, Map<String, String> partitionFilter, String tableName) static voidvalidateReadBranch(org.apache.spark.sql.SparkSession spark, Table table, String branch, org.apache.spark.sql.util.CaseInsensitiveStringMap options) static voidvalidateWriteBranch(org.apache.spark.sql.SparkSession spark, Table table, String branch, org.apache.spark.sql.util.CaseInsensitiveStringMap options)
-
Method Details
-
partitionDF
@Deprecated public static org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> partitionDF(org.apache.spark.sql.SparkSession spark, String table) Deprecated.since 1.11.0, will be removed in 1.12.0Returns a DataFrame with a row for each partition in the table.The DataFrame has 3 columns, partition key (a=1/b=2), partition location, and format (avro or parquet).
- Parameters:
spark- a Spark sessiontable- a table name and (optional) database- Returns:
- a DataFrame of the table's partitions
-
partitionDFByFilter
@Deprecated public static org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> partitionDFByFilter(org.apache.spark.sql.SparkSession spark, String table, String expression) Deprecated.since 1.11.0, will be removed in 1.12.0Returns a DataFrame with a row for each partition that matches the specified 'expression'.- Parameters:
spark- a Spark session.table- name of the table.expression- The expression whose matching partitions are returned.- Returns:
- a DataFrame of the table partitions.
-
getPartitions
@Deprecated public static List<SparkTableUtil.SparkPartition> getPartitions(org.apache.spark.sql.SparkSession spark, String table) Deprecated.since 1.11.0, will be removed in 1.12.0Returns all partitions in the table.- Parameters:
spark- a Spark sessiontable- a table name and (optional) database- Returns:
- all table's partitions
-
getPartitions
public static List<SparkTableUtil.SparkPartition> getPartitions(org.apache.spark.sql.SparkSession spark, org.apache.spark.sql.catalyst.TableIdentifier tableIdent, Map<String, String> partitionFilter) Returns all partitions in the table.- Parameters:
spark- a Spark sessiontableIdent- a table identifierpartitionFilter- partition filter, or null if no filter- Returns:
- all table's partitions
-
getPartitionsByFilter
@Deprecated public static List<SparkTableUtil.SparkPartition> getPartitionsByFilter(org.apache.spark.sql.SparkSession spark, String table, String predicate) Deprecated.since 1.11.0, will be removed in 1.12.0Returns partitions that match the specified 'predicate'.- Parameters:
spark- a Spark sessiontable- a table name and (optional) databasepredicate- a predicate on partition columns- Returns:
- matching table's partitions
-
getPartitionsByFilter
@Deprecated public static List<SparkTableUtil.SparkPartition> getPartitionsByFilter(org.apache.spark.sql.SparkSession spark, org.apache.spark.sql.catalyst.TableIdentifier tableIdent, org.apache.spark.sql.catalyst.expressions.Expression predicateExpr) Deprecated.since 1.11.0, will be removed in 1.12.0Returns partitions that match the specified 'predicate'.- Parameters:
spark- a Spark sessiontableIdent- a table identifierpredicateExpr- a predicate expression on partition columns- Returns:
- matching table's partitions
-
importSparkTable
@Deprecated public static void importSparkTable(org.apache.spark.sql.SparkSession spark, org.apache.spark.sql.catalyst.TableIdentifier sourceTableIdent, Table targetTable, String stagingDir, Map<String, String> partitionFilter, boolean checkDuplicateFiles) Deprecated.since 1.11.0, will be removed in 1.12.0Import files from an existing Spark table to an Iceberg table.The import uses the Spark session to get table metadata. It assumes no operation is going on the original and target table and thus is not thread-safe.
- Parameters:
spark- a Spark sessionsourceTableIdent- an identifier of the source Spark tabletargetTable- an Iceberg table where to import the datastagingDir- a staging directory to store temporary manifest filespartitionFilter- only import partitions whose values match those in the map, can be partially definedcheckDuplicateFiles- if true, throw exception if import results in a duplicate data file
-
importSparkTable
@Deprecated public static void importSparkTable(org.apache.spark.sql.SparkSession spark, org.apache.spark.sql.catalyst.TableIdentifier sourceTableIdent, Table targetTable, String stagingDir, int parallelism) Deprecated.since 1.11.0, will be removed in 1.12.0Import files from an existing Spark table to an Iceberg table.The import uses the Spark session to get table metadata. It assumes no operation is going on the original and target table and thus is not thread-safe.
- Parameters:
spark- a Spark sessionsourceTableIdent- an identifier of the source Spark tabletargetTable- an Iceberg table where to import the datastagingDir- a staging directory to store temporary manifest filesparallelism- number of threads to use for file reading
-
importSparkTable
public static void importSparkTable(org.apache.spark.sql.SparkSession spark, org.apache.spark.sql.catalyst.TableIdentifier sourceTableIdent, Table targetTable, String stagingDir, ExecutorService service) Import files from an existing Spark table to an Iceberg table.The import uses the Spark session to get table metadata. It assumes no operation is going on the original and target table and thus is not thread-safe.
- Parameters:
spark- a Spark sessionsourceTableIdent- an identifier of the source Spark tabletargetTable- an Iceberg table where to import the datastagingDir- a staging directory to store temporary manifest filesservice- executor service to use for file reading. If null, file reading will be performed on the current thread. * If non-null, the provided ExecutorService will be shutdown within this method after file reading is complete.
-
importSparkTable
public static void importSparkTable(org.apache.spark.sql.SparkSession spark, org.apache.spark.sql.catalyst.TableIdentifier sourceTableIdent, Table targetTable, String stagingDir, Map<String, String> partitionFilter, boolean checkDuplicateFiles, int parallelism) Import files from an existing Spark table to an Iceberg table.The import uses the Spark session to get table metadata. It assumes no operation is going on the original and target table and thus is not thread-safe.
- Parameters:
spark- a Spark sessionsourceTableIdent- an identifier of the source Spark tabletargetTable- an Iceberg table where to import the datastagingDir- a staging directory to store temporary manifest filespartitionFilter- only import partitions whose values match those in the map, can be partially definedcheckDuplicateFiles- if true, throw exception if import results in a duplicate data fileparallelism- number of threads to use for file reading
-
importSparkTable
public static void importSparkTable(org.apache.spark.sql.SparkSession spark, org.apache.spark.sql.catalyst.TableIdentifier sourceTableIdent, Table targetTable, String stagingDir, Map<String, String> partitionFilter, boolean checkDuplicateFiles, ExecutorService service) Import files from an existing Spark table to an Iceberg table.The import uses the Spark session to get table metadata. It assumes no operation is going on the original and target table and thus is not thread-safe.
- Parameters:
spark- a Spark sessionsourceTableIdent- an identifier of the source Spark tabletargetTable- an Iceberg table where to import the datastagingDir- a staging directory to store temporary manifest filespartitionFilter- only import partitions whose values match those in the map, can be partially definedcheckDuplicateFiles- if true, throw exception if import results in a duplicate data fileservice- executor service to use for file reading. If null, file reading will be performed on the current thread. If non-null, the provided ExecutorService will be shutdown within this method after file reading is complete.
-
importSparkTable
public static void importSparkTable(org.apache.spark.sql.SparkSession spark, org.apache.spark.sql.catalyst.TableIdentifier sourceTableIdent, Table targetTable, String stagingDir, Map<String, String> partitionFilter, boolean checkDuplicateFiles, boolean ignoreMissingFiles, ExecutorService service) Import files from an existing Spark table to an Iceberg table.The import uses the Spark session to get table metadata. It assumes no operation is going on the original and target table and thus is not thread-safe.
- Parameters:
spark- a Spark sessionsourceTableIdent- an identifier of the source Spark tabletargetTable- an Iceberg table where to import the datastagingDir- a staging directory to store temporary manifest filespartitionFilter- only import partitions whose values match those in the map, can be partially definedcheckDuplicateFiles- if true, throw exception if import results in a duplicate data fileignoreMissingFiles- if true, ignoreFileNotFoundExceptionwhen runninglistPartition(org.apache.iceberg.spark.SparkTableUtil.SparkPartition, org.apache.iceberg.PartitionSpec, org.apache.iceberg.hadoop.SerializableConfiguration, org.apache.iceberg.MetricsConfig, org.apache.iceberg.mapping.NameMapping, boolean, java.util.concurrent.ExecutorService)for the Spark partitionsservice- executor service to use for file reading. If null, file reading will be performed on the current thread. If non-null, the provided ExecutorService will be shutdown within this method after file reading is complete.
-
importSparkTable
@Deprecated public static void importSparkTable(org.apache.spark.sql.SparkSession spark, org.apache.spark.sql.catalyst.TableIdentifier sourceTableIdent, Table targetTable, String stagingDir, boolean checkDuplicateFiles) Deprecated.since 1.11.0, will be removed in 1.12.0Import files from an existing Spark table to an Iceberg table.The import uses the Spark session to get table metadata. It assumes no operation is going on the original and target table and thus is not thread-safe.
- Parameters:
spark- a Spark sessionsourceTableIdent- an identifier of the source Spark tabletargetTable- an Iceberg table where to import the datastagingDir- a staging directory to store temporary manifest filescheckDuplicateFiles- if true, throw exception if import results in a duplicate data file
-
importSparkTable
public static void importSparkTable(org.apache.spark.sql.SparkSession spark, org.apache.spark.sql.catalyst.TableIdentifier sourceTableIdent, Table targetTable, String stagingDir) Import files from an existing Spark table to an Iceberg table.The import uses the Spark session to get table metadata. It assumes no operation is going on the original and target table and thus is not thread-safe.
- Parameters:
spark- a Spark sessionsourceTableIdent- an identifier of the source Spark tabletargetTable- an Iceberg table where to import the datastagingDir- a staging directory to store temporary manifest files
-
importSparkPartitions
@Deprecated public static void importSparkPartitions(org.apache.spark.sql.SparkSession spark, List<SparkTableUtil.SparkPartition> partitions, Table targetTable, PartitionSpec spec, String stagingDir, boolean checkDuplicateFiles) Deprecated.since 1.11.0, will be removed in 1.12.0Import files from given partitions to an Iceberg table.- Parameters:
spark- a Spark sessionpartitions- partitions to importtargetTable- an Iceberg table where to import the dataspec- a partition specstagingDir- a staging directory to store temporary manifest filescheckDuplicateFiles- if true, throw exception if import results in a duplicate data file
-
importSparkPartitions
public static void importSparkPartitions(org.apache.spark.sql.SparkSession spark, List<SparkTableUtil.SparkPartition> partitions, Table targetTable, PartitionSpec spec, String stagingDir, boolean checkDuplicateFiles, int parallelism) Import files from given partitions to an Iceberg table.- Parameters:
spark- a Spark sessionpartitions- partitions to importtargetTable- an Iceberg table where to import the dataspec- a partition specstagingDir- a staging directory to store temporary manifest filescheckDuplicateFiles- if true, throw exception if import results in a duplicate data fileparallelism- number of threads to use for file reading
-
importSparkPartitions
public static void importSparkPartitions(org.apache.spark.sql.SparkSession spark, List<SparkTableUtil.SparkPartition> partitions, Table targetTable, PartitionSpec spec, String stagingDir, boolean checkDuplicateFiles, ExecutorService service) Import files from given partitions to an Iceberg table.- Parameters:
spark- a Spark sessionpartitions- partitions to importtargetTable- an Iceberg table where to import the dataspec- a partition specstagingDir- a staging directory to store temporary manifest filescheckDuplicateFiles- if true, throw exception if import results in a duplicate data fileservice- executor service to use for file reading. If null, file reading will be performed on the current thread. If non-null, the provided ExecutorService will be shutdown within this method after file reading is complete.
-
importSparkPartitions
public static void importSparkPartitions(org.apache.spark.sql.SparkSession spark, List<SparkTableUtil.SparkPartition> partitions, Table targetTable, PartitionSpec spec, String stagingDir, boolean checkDuplicateFiles, boolean ignoreMissingFiles, ExecutorService service) Import files from given partitions to an Iceberg table.- Parameters:
spark- a Spark sessionpartitions- partitions to importtargetTable- an Iceberg table where to import the dataspec- a partition specstagingDir- a staging directory to store temporary manifest filescheckDuplicateFiles- if true, throw exception if import results in a duplicate data fileignoreMissingFiles- if true, ignoreFileNotFoundExceptionwhen runninglistPartition(org.apache.iceberg.spark.SparkTableUtil.SparkPartition, org.apache.iceberg.PartitionSpec, org.apache.iceberg.hadoop.SerializableConfiguration, org.apache.iceberg.MetricsConfig, org.apache.iceberg.mapping.NameMapping, boolean, java.util.concurrent.ExecutorService)for the Spark partitionsservice- executor service to use for file reading. If null, file reading will be performed on the current thread. If non-null, the provided ExecutorService will be shutdown within this method after file reading is complete.
-
importSparkPartitions
@Deprecated public static void importSparkPartitions(org.apache.spark.sql.SparkSession spark, List<SparkTableUtil.SparkPartition> partitions, Table targetTable, PartitionSpec spec, String stagingDir) Deprecated.since 1.11.0, will be removed in 1.12.0Import files from given partitions to an Iceberg table.- Parameters:
spark- a Spark sessionpartitions- partitions to importtargetTable- an Iceberg table where to import the dataspec- a partition specstagingDir- a staging directory to store temporary manifest files
-
filterPartitions
@Deprecated public static List<SparkTableUtil.SparkPartition> filterPartitions(List<SparkTableUtil.SparkPartition> partitions, Map<String, String> partitionFilter) Deprecated.since 1.11.0, will be removed in 1.12.0 -
loadTable
public static org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> loadTable(org.apache.spark.sql.SparkSession spark, Table table, long snapshotId) -
loadMetadataTable
public static org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> loadMetadataTable(org.apache.spark.sql.SparkSession spark, Table table, MetadataTableType type) -
loadMetadataTable
public static org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> loadMetadataTable(org.apache.spark.sql.SparkSession spark, Table table, MetadataTableType type, Map<String, String> extraOptions) -
validateWriteBranch
-
validateReadBranch
-
determineWriteBranch
public static String determineWriteBranch(org.apache.spark.sql.SparkSession spark, SparkTable sparkTable, org.apache.spark.sql.util.CaseInsensitiveStringMap options) -
determineWriteBranch
public static String determineWriteBranch(org.apache.spark.sql.SparkSession spark, Table table, String branch, org.apache.spark.sql.util.CaseInsensitiveStringMap options) Determine the write branch.The target branch can be specified via table identifier, write option, or in SQL:
- The identifier and option branches can't conflict. If both are set, they must match.
- Identifier and option branches take priority over the session WAP branch.
- If neither the option nor the identifier branch is set and WAP is enabled for this table, use the WAP branch from the session SQL config.
Note: WAP ID and WAP branch cannot be set at the same time.
Note: The target branch may be created during the write operation if it does not exist.
- Parameters:
spark- a Spark Sessiontable- the table being written tobranch- write branch configured via table identifier, or nulloptions- write options- Returns:
- branch for write operation, or null for main branch
-
determineReadBranch
public static String determineReadBranch(org.apache.spark.sql.SparkSession spark, SparkTable sparkTable, org.apache.spark.sql.util.CaseInsensitiveStringMap options) -
determineReadBranch
public static String determineReadBranch(org.apache.spark.sql.SparkSession spark, Table table, String branch, org.apache.spark.sql.util.CaseInsensitiveStringMap options) Determine the read branch.The target branch can be specified via table identifier, read option, or in SQL:
- The identifier and option branches can't conflict. If both are set, they must match.
- Identifier and option branches take priority over the session WAP branch.
- If neither the option nor the identifier branch is set and WAP is enabled for this table, use the WAP branch from the session SQL config (only if the branch already exists).
Note: WAP ID and WAP branch cannot be set at the same time.
- Parameters:
spark- a Spark Sessiontable- the table being read frombranch- read branch configured via table identifier, or nulloptions- read options- Returns:
- branch for read operation, or null for main branch
-
migrationService
-
findCompatibleSpec
Returns the first partition spec in an IcebergTable that shares the same names and ordering as the partition columns provided. Throws an error if not found -
validatePartitionFilter
public static void validatePartitionFilter(PartitionSpec spec, Map<String, String> partitionFilter, String tableName)
-