java.lang.Object

org.apache.iceberg.spark.actions.RewriteDataFilesSparkAction

All Implemented Interfaces:: Action<RewriteDataFiles,RewriteDataFiles.Result>, RewriteDataFiles, SnapshotUpdate<RewriteDataFiles,RewriteDataFiles.Result>

public class RewriteDataFilesSparkAction extends Object implements RewriteDataFiles

Nested Class Summary

Nested classes/interfaces inherited from interface org.apache.iceberg.actions.RewriteDataFiles
RewriteDataFiles.FileGroupFailureResult, RewriteDataFiles.FileGroupInfo, RewriteDataFiles.FileGroupRewriteResult, RewriteDataFiles.Result
Field Summary

Fields

Modifier and Type

Field

Description

protected static final org.apache.iceberg.relocated.com.google.common.base.Joiner

COMMA_JOINER

protected static final org.apache.iceberg.relocated.com.google.common.base.Splitter

COMMA_SPLITTER

protected static final String

FILE_PATH

protected static final String

LAST_MODIFIED

protected static final String

MANIFEST

protected static final String

MANIFEST_LIST

protected static final String

OTHERS

protected static final String

STATISTICS_FILES

Fields inherited from interface org.apache.iceberg.actions.RewriteDataFiles
MAX_CONCURRENT_FILE_GROUP_REWRITES, MAX_CONCURRENT_FILE_GROUP_REWRITES_DEFAULT, MAX_FILE_GROUP_SIZE_BYTES, MAX_FILE_GROUP_SIZE_BYTES_DEFAULT, OUTPUT_SPEC_ID, PARTIAL_PROGRESS_ENABLED, PARTIAL_PROGRESS_ENABLED_DEFAULT, PARTIAL_PROGRESS_MAX_COMMITS, PARTIAL_PROGRESS_MAX_COMMITS_DEFAULT, PARTIAL_PROGRESS_MAX_FAILED_COMMITS, REWRITE_JOB_ORDER, REWRITE_JOB_ORDER_DEFAULT, TARGET_FILE_SIZE_BYTES, USE_STARTING_SEQUENCE_NUMBER, USE_STARTING_SEQUENCE_NUMBER_DEFAULT
Method Summary

Modifier and Type

Method

Description

protected org.apache.spark.sql.Dataset<FileInfo>

allReachableOtherMetadataFileDS(Table table)

RewriteDataFilesSparkAction

binPack()

Choose BINPACK as a strategy for this rewrite operation

protected void

commit(SnapshotUpdate<?> update)

protected Map<String,String>

commitSummary()

protected org.apache.spark.sql.Dataset<FileInfo>

contentFileDS(Table table)

protected org.apache.spark.sql.Dataset<FileInfo>

contentFileDS(Table table, Set<Long> snapshotIds)

protected org.apache.iceberg.spark.actions.BaseSparkAction.DeleteSummary

deleteFiles(ExecutorService executorService, Consumer<String> deleteFunc, Iterator<FileInfo> files)

Deletes files and keeps track of how many files were removed for each file type.

protected org.apache.iceberg.spark.actions.BaseSparkAction.DeleteSummary

deleteFiles(SupportsBulkOperations io, Iterator<FileInfo> files)

RewriteDataFiles.Result

execute()

Executes this action.

RewriteDataFilesSparkAction

filter(Expression expression)

A user provided filter for determining which files will be considered by the rewrite strategy.

protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>

loadMetadataTable(Table table, MetadataTableType type)

protected org.apache.spark.sql.Dataset<FileInfo>

manifestDS(Table table)

protected org.apache.spark.sql.Dataset<FileInfo>

manifestDS(Table table, Set<Long> snapshotIds)

protected org.apache.spark.sql.Dataset<FileInfo>

manifestListDS(Table table)

protected org.apache.spark.sql.Dataset<FileInfo>

manifestListDS(Table table, Set<Long> snapshotIds)

protected JobGroupInfo

newJobGroupInfo(String groupId, String desc)

protected Table

newStaticTable(TableMetadata metadata, FileIO io)

RewriteDataFilesSparkAction

option(String name, String value)

protected Map<String,String>

options()

RewriteDataFilesSparkAction

options(Map<String,String> newOptions)

protected org.apache.spark.sql.Dataset<FileInfo>

otherMetadataFileDS(Table table)

protected RewriteDataFilesSparkAction

self()

RewriteDataFilesSparkAction

snapshotProperty(String property, String value)

RewriteDataFilesSparkAction

sort()

Choose SORT as a strategy for this rewrite operation using the table's sortOrder

RewriteDataFilesSparkAction

sort(SortOrder sortOrder)

Choose SORT as a strategy for this rewrite operation and manually specify the sortOrder to use

protected org.apache.spark.sql.SparkSession

spark()

protected org.apache.spark.api.java.JavaSparkContext

sparkContext()

protected org.apache.spark.sql.Dataset<FileInfo>

statisticsFileDS(Table table, Set<Long> snapshotIds)

protected <T> T

withJobGroupInfo(JobGroupInfo info, Supplier<T> supplier)

RewriteDataFilesSparkAction

zOrder(String... columnNames)

Choose Z-ORDER as a strategy for this rewrite operation with a specified list of columns to use

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface org.apache.iceberg.actions.Action
option, options

Methods inherited from interface org.apache.iceberg.actions.SnapshotUpdate
snapshotProperty

Field Details
- MANIFEST
  
  protected static final String MANIFEST
  See Also:
  
  Constant Field Values
- MANIFEST_LIST
  
  protected static final String MANIFEST_LIST
  See Also:
  
  Constant Field Values
- STATISTICS_FILES
  
  protected static final String STATISTICS_FILES
  See Also:
  
  Constant Field Values
- OTHERS
  
  protected static final String OTHERS
  See Also:
  
  Constant Field Values
- FILE_PATH
  
  protected static final String FILE_PATH
  See Also:
  
  Constant Field Values
- LAST_MODIFIED
  
  protected static final String LAST_MODIFIED
  See Also:
  
  Constant Field Values
- COMMA_SPLITTER
  
  protected static final org.apache.iceberg.relocated.com.google.common.base.Splitter COMMA_SPLITTER
- COMMA_JOINER
  
  protected static final org.apache.iceberg.relocated.com.google.common.base.Joiner COMMA_JOINER
Method Details
- self
  
  protected RewriteDataFilesSparkAction self()
- binPack
  
  public RewriteDataFilesSparkAction binPack()
  
  Description copied from interface: RewriteDataFiles
  
  Choose BINPACK as a strategy for this rewrite operation
  
  Specified by:
  
  binPack in interface RewriteDataFiles
  
  Returns:
  
  this for method chaining
- sort
  
  public RewriteDataFilesSparkAction sort(SortOrder sortOrder)
  
  Description copied from interface: RewriteDataFiles
  
  Choose SORT as a strategy for this rewrite operation and manually specify the sortOrder to use
  
  Specified by:
  
  sort in interface RewriteDataFiles
  
  Parameters:
  
  sortOrder - user defined sortOrder
  
  Returns:
  
  this for method chaining
- sort
  
  public RewriteDataFilesSparkAction sort()
  
  Description copied from interface: RewriteDataFiles
  
  Choose SORT as a strategy for this rewrite operation using the table's sortOrder
  
  Specified by:
  
  sort in interface RewriteDataFiles
  
  Returns:
  
  this for method chaining
- zOrder
  
  public RewriteDataFilesSparkAction zOrder(String... columnNames)
  
  Description copied from interface: RewriteDataFiles
  
  Choose Z-ORDER as a strategy for this rewrite operation with a specified list of columns to use
  
  Specified by:
  
  zOrder in interface RewriteDataFiles
  
  Parameters:
  
  columnNames - Columns to be used to generate Z-Values
  
  Returns:
  
  this for method chaining
- filter
  
  public RewriteDataFilesSparkAction filter(Expression expression)
  
  Description copied from interface: RewriteDataFiles
  
  A user provided filter for determining which files will be considered by the rewrite strategy. This will be used in addition to whatever rules the rewrite strategy generates. For example this would be used for providing a restriction to only run rewrite on a specific partition.
  
  Specified by:
  
  filter in interface RewriteDataFiles
  
  Parameters:
  
  expression - An iceberg expression used to determine which files will be considered for rewriting
  
  Returns:
  
  this for chaining
- execute
  
  public RewriteDataFiles.Result execute()
  
  Description copied from interface: Action
  
  Executes this action.
  
  Specified by:
  
  execute in interface Action<RewriteDataFiles,RewriteDataFiles.Result>
  
  Returns:
  
  the result of this action
- snapshotProperty
  
  public RewriteDataFilesSparkAction snapshotProperty(String property, String value)
- commit
  
  protected void commit(SnapshotUpdate<?> update)
- commitSummary
  
  protected Map<String,String> commitSummary()
- spark
  
  protected org.apache.spark.sql.SparkSession spark()
- sparkContext
  
  protected org.apache.spark.api.java.JavaSparkContext sparkContext()
- option
  
  public RewriteDataFilesSparkAction option(String name, String value)
- options
  
  public RewriteDataFilesSparkAction options(Map<String,String> newOptions)
- options
  
  protected Map<String,String> options()
- withJobGroupInfo
  
  protected <T> T withJobGroupInfo(JobGroupInfo info, Supplier<T> supplier)
- newJobGroupInfo
  
  protected JobGroupInfo newJobGroupInfo(String groupId, String desc)
- newStaticTable
  
  protected Table newStaticTable(TableMetadata metadata, FileIO io)
- contentFileDS
  
  protected org.apache.spark.sql.Dataset<FileInfo> contentFileDS(Table table)
- contentFileDS
  
  protected org.apache.spark.sql.Dataset<FileInfo> contentFileDS(Table table, Set<Long> snapshotIds)
- manifestDS
  
  protected org.apache.spark.sql.Dataset<FileInfo> manifestDS(Table table)
- manifestDS
  
  protected org.apache.spark.sql.Dataset<FileInfo> manifestDS(Table table, Set<Long> snapshotIds)
- manifestListDS
  
  protected org.apache.spark.sql.Dataset<FileInfo> manifestListDS(Table table)
- manifestListDS
  
  protected org.apache.spark.sql.Dataset<FileInfo> manifestListDS(Table table, Set<Long> snapshotIds)
- statisticsFileDS
  
  protected org.apache.spark.sql.Dataset<FileInfo> statisticsFileDS(Table table, Set<Long> snapshotIds)
- otherMetadataFileDS
  
  protected org.apache.spark.sql.Dataset<FileInfo> otherMetadataFileDS(Table table)
- allReachableOtherMetadataFileDS
  
  protected org.apache.spark.sql.Dataset<FileInfo> allReachableOtherMetadataFileDS(Table table)
- loadMetadataTable
  
  protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> loadMetadataTable(Table table, MetadataTableType type)
- deleteFiles
  
  protected org.apache.iceberg.spark.actions.BaseSparkAction.DeleteSummary deleteFiles(ExecutorService executorService, Consumer<String> deleteFunc, Iterator<FileInfo> files)
  
  Deletes files and keeps track of how many files were removed for each file type.
  
  Parameters:
  
  executorService - an executor service to use for parallel deletes
  
  deleteFunc - a delete func
  
  files - an iterator of Spark rows of the structure (path: String, type: String)
  
  Returns:
  
  stats on which files were deleted
- deleteFiles
  
  protected org.apache.iceberg.spark.actions.BaseSparkAction.DeleteSummary deleteFiles(SupportsBulkOperations io, Iterator<FileInfo> files)

Class RewriteDataFilesSparkAction

Nested Class Summary

Nested classes/interfaces inherited from interface org.apache.iceberg.actions.RewriteDataFiles

Field Summary

Fields inherited from interface org.apache.iceberg.actions.RewriteDataFiles

Method Summary

Methods inherited from class java.lang.Object

Methods inherited from interface org.apache.iceberg.actions.Action

Methods inherited from interface org.apache.iceberg.actions.SnapshotUpdate

Field Details

MANIFEST

MANIFEST_LIST

STATISTICS_FILES

OTHERS

FILE_PATH

LAST_MODIFIED

COMMA_SPLITTER

COMMA_JOINER

Method Details

self

binPack

sort

sort

zOrder

filter

execute

snapshotProperty

commit

commitSummary

spark

sparkContext

option

options

options

withJobGroupInfo

newJobGroupInfo

newStaticTable

contentFileDS

contentFileDS

manifestDS

manifestDS

manifestListDS

manifestListDS

statisticsFileDS

otherMetadataFileDS

allReachableOtherMetadataFileDS

loadMetadataTable

deleteFiles

deleteFiles