All Superinterfaces:: Action<RewriteDataFiles,RewriteDataFiles.Result>, SnapshotUpdate<RewriteDataFiles,RewriteDataFiles.Result>

All Known Implementing Classes:: RewriteDataFilesSparkAction

public interface RewriteDataFiles extends SnapshotUpdate<RewriteDataFiles,RewriteDataFiles.Result>

An action for rewriting data files according to a rewrite strategy. Generally used for optimizing the sizing and layout of data files within a table.

Nested Class Summary

Nested Classes

Modifier and Type

Interface

Description

static interface

RewriteDataFiles.FileGroupFailureResult

For a file group that failed to rewrite.

static interface

RewriteDataFiles.FileGroupInfo

A description of a file group, when it was processed, and within which partition.

static interface

RewriteDataFiles.FileGroupRewriteResult

For a particular file group, the number of files which are newly created and the number of files which were formerly part of the table but have been rewritten.

static interface

RewriteDataFiles.Result

A map of file group information to the results of rewriting that file group.
Field Summary

Fields

Modifier and Type

Field

Description

static final String

MAX_CONCURRENT_FILE_GROUP_REWRITES

The max number of file groups to be simultaneously rewritten by the rewrite strategy.

static final int

MAX_CONCURRENT_FILE_GROUP_REWRITES_DEFAULT

static final String

MAX_FILE_GROUP_SIZE_BYTES

The entire rewrite operation is broken down into pieces based on partitioning and within partitions based on size into groups.

static final long

MAX_FILE_GROUP_SIZE_BYTES_DEFAULT

static final String

OUTPUT_SPEC_ID

The partition specification ID to be used for rewritten files

static final String

PARTIAL_PROGRESS_ENABLED

Enable committing groups of files (see max-file-group-size-bytes) prior to the entire rewrite completing.

static final boolean

PARTIAL_PROGRESS_ENABLED_DEFAULT

static final String

PARTIAL_PROGRESS_MAX_COMMITS

The maximum amount of Iceberg commits that this rewrite is allowed to produce if partial progress is enabled.

static final int

PARTIAL_PROGRESS_MAX_COMMITS_DEFAULT

static final String

PARTIAL_PROGRESS_MAX_FAILED_COMMITS

The maximum amount of failed commits that this rewrite is allowed if partial progress is enabled.

static final String

REWRITE_JOB_ORDER

Forces the rewrite job order based on the value.

static final String

REWRITE_JOB_ORDER_DEFAULT

static final String

TARGET_FILE_SIZE_BYTES

The output file size that this rewrite strategy will attempt to generate when rewriting files.

static final String

USE_STARTING_SEQUENCE_NUMBER

If the compaction should use the sequence number of the snapshot at compaction start time for new data files, instead of using the sequence number of the newly produced snapshot.

static final boolean

USE_STARTING_SEQUENCE_NUMBER_DEFAULT
Method Summary

Modifier and Type

Method

Description

default RewriteDataFiles

binPack()

Choose BINPACK as a strategy for this rewrite operation

RewriteDataFiles

filter(Expression expression)

A user provided filter for determining which files will be considered by the rewrite strategy.

default RewriteDataFiles

sort()

Choose SORT as a strategy for this rewrite operation using the table's sortOrder

default RewriteDataFiles

sort(SortOrder sortOrder)

Choose SORT as a strategy for this rewrite operation and manually specify the sortOrder to use

default RewriteDataFiles

zOrder(String... columns)

Choose Z-ORDER as a strategy for this rewrite operation with a specified list of columns to use

Methods inherited from interface org.apache.iceberg.actions.Action
execute, option, options

Methods inherited from interface org.apache.iceberg.actions.SnapshotUpdate
snapshotProperty

Field Details
- PARTIAL_PROGRESS_ENABLED
  
  static final String PARTIAL_PROGRESS_ENABLED
  
  Enable committing groups of files (see max-file-group-size-bytes) prior to the entire rewrite completing. This will produce additional commits but allow for progress even if some groups fail to commit. This setting will not change the correctness of the rewrite operation as file groups can be compacted independently.
  The default is false, which produces a single commit when the entire job has completed.
  See Also:
  
  Constant Field Values
- PARTIAL_PROGRESS_ENABLED_DEFAULT
  
  static final boolean PARTIAL_PROGRESS_ENABLED_DEFAULT
  See Also:
  
  Constant Field Values
- PARTIAL_PROGRESS_MAX_COMMITS
  
  static final String PARTIAL_PROGRESS_MAX_COMMITS
  
  The maximum amount of Iceberg commits that this rewrite is allowed to produce if partial progress is enabled. This setting has no effect if partial progress is disabled.
  See Also:
  
  Constant Field Values
- PARTIAL_PROGRESS_MAX_COMMITS_DEFAULT
  
  static final int PARTIAL_PROGRESS_MAX_COMMITS_DEFAULT
  See Also:
  
  Constant Field Values
- PARTIAL_PROGRESS_MAX_FAILED_COMMITS
  
  static final String PARTIAL_PROGRESS_MAX_FAILED_COMMITS
  
  The maximum amount of failed commits that this rewrite is allowed if partial progress is enabled. By default, all commits are allowed to fail. This setting has no effect if partial progress is disabled.
  See Also:
  
  Constant Field Values
- MAX_FILE_GROUP_SIZE_BYTES
  
  static final String MAX_FILE_GROUP_SIZE_BYTES
  
  The entire rewrite operation is broken down into pieces based on partitioning and within partitions based on size into groups. These sub-units of the rewrite are referred to as file groups. The largest amount of data that should be compacted in a single group is controlled by MAX_FILE_GROUP_SIZE_BYTES. This helps with breaking down the rewriting of very large partitions which may not be rewritable otherwise due to the resource constraints of the cluster. For example a sort based rewrite may not scale to terabyte sized partitions, those partitions need to be worked on in small subsections to avoid exhaustion of resources.
  When grouping files, the underlying rewrite strategy will use this value as to limit the files which will be included in a single file group. A group will be processed by a single framework "action". For example, in Spark this means that each group would be rewritten in its own Spark action. A group will never contain files for multiple output partitions.
  See Also:
  
  Constant Field Values
- MAX_FILE_GROUP_SIZE_BYTES_DEFAULT
  
  static final long MAX_FILE_GROUP_SIZE_BYTES_DEFAULT
  See Also:
  
  Constant Field Values
- MAX_CONCURRENT_FILE_GROUP_REWRITES
  
  static final String MAX_CONCURRENT_FILE_GROUP_REWRITES
  
  The max number of file groups to be simultaneously rewritten by the rewrite strategy. The structure and contents of the group is determined by the rewrite strategy. Each file group will be rewritten independently and asynchronously.
  See Also:
  
  Constant Field Values
- MAX_CONCURRENT_FILE_GROUP_REWRITES_DEFAULT
  
  static final int MAX_CONCURRENT_FILE_GROUP_REWRITES_DEFAULT
  See Also:
  
  Constant Field Values
- TARGET_FILE_SIZE_BYTES
  
  static final String TARGET_FILE_SIZE_BYTES
  
  The output file size that this rewrite strategy will attempt to generate when rewriting files. By default this will use the "write.target-file-size-bytes value" in the table properties of the table being updated.
  See Also:
  
  Constant Field Values
- USE_STARTING_SEQUENCE_NUMBER
  
  static final String USE_STARTING_SEQUENCE_NUMBER
  
  If the compaction should use the sequence number of the snapshot at compaction start time for new data files, instead of using the sequence number of the newly produced snapshot.
  This avoids commit conflicts with updates that add newer equality deletes at a higher sequence number.
  Defaults to true.
  See Also:
  
  Constant Field Values
- USE_STARTING_SEQUENCE_NUMBER_DEFAULT
  
  static final boolean USE_STARTING_SEQUENCE_NUMBER_DEFAULT
  See Also:
  
  Constant Field Values
- REWRITE_JOB_ORDER
  
  static final String REWRITE_JOB_ORDER
  Forces the rewrite job order based on the value.
  
  If rewrite-job-order=bytes-asc, then rewrite the smallest job groups first.
  If rewrite-job-order=bytes-desc, then rewrite the largest job groups first.
  If rewrite-job-order=files-asc, then rewrite the job groups with the least files first.
  If rewrite-job-order=files-desc, then rewrite the job groups with the most files first.
  If rewrite-job-order=none, then rewrite job groups in the order they were planned (no specific ordering).
  
  Defaults to none.
  See Also:
  
  Constant Field Values
- REWRITE_JOB_ORDER_DEFAULT
  
  static final String REWRITE_JOB_ORDER_DEFAULT
- OUTPUT_SPEC_ID
  
  static final String OUTPUT_SPEC_ID
  
  The partition specification ID to be used for rewritten files
  output-spec-id ID is used by the file rewriter during the rewrite operation to identify the specific output partition spec. Data will be reorganized during the rewrite to align with the output partitioning. Defaults to the current table specification.
  See Also:
  
  Constant Field Values
Method Details
- binPack
  
  default RewriteDataFiles binPack()
  
  Choose BINPACK as a strategy for this rewrite operation
  
  Returns:
  
  this for method chaining
- sort
  
  default RewriteDataFiles sort()
  
  Choose SORT as a strategy for this rewrite operation using the table's sortOrder
  
  Returns:
  
  this for method chaining
- sort
  
  default RewriteDataFiles sort(SortOrder sortOrder)
  
  Choose SORT as a strategy for this rewrite operation and manually specify the sortOrder to use
  
  Parameters:
  
  sortOrder - user defined sortOrder
  
  Returns:
  
  this for method chaining
- zOrder
  
  default RewriteDataFiles zOrder(String... columns)
  
  Choose Z-ORDER as a strategy for this rewrite operation with a specified list of columns to use
  
  Parameters:
  
  columns - Columns to be used to generate Z-Values
  
  Returns:
  
  this for method chaining
- filter
  
  RewriteDataFiles filter(Expression expression)
  
  A user provided filter for determining which files will be considered by the rewrite strategy. This will be used in addition to whatever rules the rewrite strategy generates. For example this would be used for providing a restriction to only run rewrite on a specific partition.
  
  Parameters:
  
  expression - An iceberg expression used to determine which files will be considered for rewriting
  
  Returns:
  
  this for chaining

Interface RewriteDataFiles

Nested Class Summary

Field Summary

Method Summary

Methods inherited from interface org.apache.iceberg.actions.Action

Methods inherited from interface org.apache.iceberg.actions.SnapshotUpdate

Field Details

PARTIAL_PROGRESS_ENABLED

PARTIAL_PROGRESS_ENABLED_DEFAULT

PARTIAL_PROGRESS_MAX_COMMITS

PARTIAL_PROGRESS_MAX_COMMITS_DEFAULT

PARTIAL_PROGRESS_MAX_FAILED_COMMITS

MAX_FILE_GROUP_SIZE_BYTES

MAX_FILE_GROUP_SIZE_BYTES_DEFAULT

MAX_CONCURRENT_FILE_GROUP_REWRITES

MAX_CONCURRENT_FILE_GROUP_REWRITES_DEFAULT

TARGET_FILE_SIZE_BYTES

USE_STARTING_SEQUENCE_NUMBER

USE_STARTING_SEQUENCE_NUMBER_DEFAULT

REWRITE_JOB_ORDER

REWRITE_JOB_ORDER_DEFAULT

OUTPUT_SPEC_ID

Method Details

binPack

sort

sort

zOrder

filter