Interface FileRewriter<T extends ContentScanTask<F>,F extends ContentFile<F>>

Type Parameters:
T - the Java type of tasks to read content files
F - the Java type of content files
All Known Implementing Classes:
SizeBasedDataRewriter, SizeBasedFileRewriter, SizeBasedPositionDeletesRewriter

public interface FileRewriter<T extends ContentScanTask<F>,F extends ContentFile<F>>
A class for rewriting content files.

The entire rewrite operation is broken down into pieces based on partitioning, and size-based groups within a partition. These subunits of the rewrite are referred to as file groups. A file group will be processed by a single framework "action". For example, in Spark this means that each group would be rewritten in its own Spark job.

  • Method Summary

    Modifier and Type
    Method
    Description
    default String
    Returns a description for this rewriter.
    void
    init(Map<String,String> options)
    Initializes this rewriter using provided options.
    Selects files which this rewriter believes are valid targets to be rewritten based on their scan tasks and groups those scan tasks into file groups.
    rewrite(List<T> group)
    Rewrite a group of files represented by the given list of scan tasks.
    Returns a set of supported options for this rewriter.
  • Method Details

    • description

      default String description()
      Returns a description for this rewriter.
    • validOptions

      Set<String> validOptions()
      Returns a set of supported options for this rewriter. Only options specified in this list will be accepted at runtime. Any other options will be rejected.
    • init

      void init(Map<String,String> options)
      Initializes this rewriter using provided options.
      Parameters:
      options - options to initialize this rewriter
    • planFileGroups

      Iterable<List<T>> planFileGroups(Iterable<T> tasks)
      Selects files which this rewriter believes are valid targets to be rewritten based on their scan tasks and groups those scan tasks into file groups. The file groups are then rewritten in a single executable unit, such as a Spark job.
      Parameters:
      tasks - an iterable of scan task for files in a partition
      Returns:
      groups of scan tasks for files to be rewritten in a single executable unit
    • rewrite

      Set<F> rewrite(List<T> group)
      Rewrite a group of files represented by the given list of scan tasks.

      The implementation is supposed to be engine-specific (e.g. Spark, Flink, Trino).

      Parameters:
      group - a group of scan tasks for files to be rewritten together
      Returns:
      a set of newly written files