Package org.apache.iceberg.actions
Interface FileRewriter<T extends ContentScanTask<F>,F extends ContentFile<F>>
- Type Parameters:
T
- the Java type of tasks to read content filesF
- the Java type of content files
- All Known Implementing Classes:
SizeBasedDataRewriter
,SizeBasedFileRewriter
,SizeBasedPositionDeletesRewriter
public interface FileRewriter<T extends ContentScanTask<F>,F extends ContentFile<F>>
A class for rewriting content files.
The entire rewrite operation is broken down into pieces based on partitioning, and size-based groups within a partition. These subunits of the rewrite are referred to as file groups. A file group will be processed by a single framework "action". For example, in Spark this means that each group would be rewritten in its own Spark job.
-
Method Summary
Modifier and TypeMethodDescriptiondefault String
Returns a description for this rewriter.void
Initializes this rewriter using provided options.planFileGroups
(Iterable<T> tasks) Selects files which this rewriter believes are valid targets to be rewritten based on their scan tasks and groups those scan tasks into file groups.Rewrite a group of files represented by the given list of scan tasks.Returns a set of supported options for this rewriter.
-
Method Details
-
description
Returns a description for this rewriter. -
validOptions
Returns a set of supported options for this rewriter. Only options specified in this list will be accepted at runtime. Any other options will be rejected. -
init
Initializes this rewriter using provided options.- Parameters:
options
- options to initialize this rewriter
-
planFileGroups
Selects files which this rewriter believes are valid targets to be rewritten based on their scan tasks and groups those scan tasks into file groups. The file groups are then rewritten in a single executable unit, such as a Spark job.- Parameters:
tasks
- an iterable of scan task for files in a partition- Returns:
- groups of scan tasks for files to be rewritten in a single executable unit
-
rewrite
Rewrite a group of files represented by the given list of scan tasks.The implementation is supposed to be engine-specific (e.g. Spark, Flink, Trino).
- Parameters:
group
- a group of scan tasks for files to be rewritten together- Returns:
- a set of newly written files
-