Interface FileRewriter<T extends ContentScanTask<F>,​F extends ContentFile<F>>

  • Type Parameters:
    T - the Java type of tasks to read content files
    F - the Java type of content files
    All Known Implementing Classes:
    SizeBasedDataRewriter, SizeBasedFileRewriter, SizeBasedPositionDeletesRewriter

    public interface FileRewriter<T extends ContentScanTask<F>,​F extends ContentFile<F>>
    A class for rewriting content files.

    The entire rewrite operation is broken down into pieces based on partitioning, and size-based groups within a partition. These subunits of the rewrite are referred to as file groups. A file group will be processed by a single framework "action". For example, in Spark this means that each group would be rewritten in its own Spark job.

    • Method Summary

      All Methods Instance Methods Abstract Methods Default Methods 
      Modifier and Type Method Description
      default java.lang.String description()
      Returns a description for this rewriter.
      void init​(java.util.Map<java.lang.String,​java.lang.String> options)
      Initializes this rewriter using provided options.
      java.lang.Iterable<java.util.List<T>> planFileGroups​(java.lang.Iterable<T> tasks)
      Selects files which this rewriter believes are valid targets to be rewritten based on their scan tasks and groups those scan tasks into file groups.
      java.util.Set<F> rewrite​(java.util.List<T> group)
      Rewrite a group of files represented by the given list of scan tasks.
      java.util.Set<java.lang.String> validOptions()
      Returns a set of supported options for this rewriter.
    • Method Detail

      • description

        default java.lang.String description()
        Returns a description for this rewriter.
      • validOptions

        java.util.Set<java.lang.String> validOptions()
        Returns a set of supported options for this rewriter. Only options specified in this list will be accepted at runtime. Any other options will be rejected.
      • init

        void init​(java.util.Map<java.lang.String,​java.lang.String> options)
        Initializes this rewriter using provided options.
        Parameters:
        options - options to initialize this rewriter
      • planFileGroups

        java.lang.Iterable<java.util.List<T>> planFileGroups​(java.lang.Iterable<T> tasks)
        Selects files which this rewriter believes are valid targets to be rewritten based on their scan tasks and groups those scan tasks into file groups. The file groups are then rewritten in a single executable unit, such as a Spark job.
        Parameters:
        tasks - an iterable of scan task for files in a partition
        Returns:
        groups of scan tasks for files to be rewritten in a single executable unit
      • rewrite

        java.util.Set<F> rewrite​(java.util.List<T> group)
        Rewrite a group of files represented by the given list of scan tasks.

        The implementation is supposed to be engine-specific (e.g. Spark, Flink, Trino).

        Parameters:
        group - a group of scan tasks for files to be rewritten together
        Returns:
        a set of newly written files