Class SizeBasedFileRewriter<T extends ContentScanTask<F>,F extends ContentFile<F>>

java.lang.Object
org.apache.iceberg.actions.SizeBasedFileRewriter<T,F>
All Implemented Interfaces:
FileRewriter<T,F>
Direct Known Subclasses:
SizeBasedDataRewriter, SizeBasedPositionDeletesRewriter

public abstract class SizeBasedFileRewriter<T extends ContentScanTask<F>,F extends ContentFile<F>> extends Object implements FileRewriter<T,F>
A file rewriter that determines which files to rewrite based on their size.

If files are smaller than the MIN_FILE_SIZE_BYTES threshold or larger than the MAX_FILE_SIZE_BYTES threshold, they are considered targets for being rewritten.

Once selected, files are grouped based on the bin-packing algorithm into groups of no more than MAX_FILE_GROUP_SIZE_BYTES. Groups will be actually rewritten if they contain more than MIN_INPUT_FILES or if they would produce at least one file of TARGET_FILE_SIZE_BYTES.

Note that implementations may add extra conditions for selecting files or filtering groups.

  • Field Details

    • TARGET_FILE_SIZE_BYTES

      public static final String TARGET_FILE_SIZE_BYTES
      The target output file size that this file rewriter will attempt to generate.
      See Also:
    • MIN_FILE_SIZE_BYTES

      public static final String MIN_FILE_SIZE_BYTES
      Controls which files will be considered for rewriting. Files with sizes under this threshold will be considered for rewriting regardless of any other criteria.

      Defaults to 75% of the target file size.

      See Also:
    • MIN_FILE_SIZE_DEFAULT_RATIO

      public static final double MIN_FILE_SIZE_DEFAULT_RATIO
      See Also:
    • MAX_FILE_SIZE_BYTES

      public static final String MAX_FILE_SIZE_BYTES
      Controls which files will be considered for rewriting. Files with sizes above this threshold will be considered for rewriting regardless of any other criteria.

      Defaults to 180% of the target file size.

      See Also:
    • MAX_FILE_SIZE_DEFAULT_RATIO

      public static final double MAX_FILE_SIZE_DEFAULT_RATIO
      See Also:
    • MIN_INPUT_FILES

      public static final String MIN_INPUT_FILES
      Any file group exceeding this number of files will be rewritten regardless of other criteria. This config ensures file groups that contain many files are compacted even if the total size of that group is less than the target file size. This can also be thought of as the maximum number of wrongly sized files that could remain in a partition after rewriting.
      See Also:
    • MIN_INPUT_FILES_DEFAULT

      public static final int MIN_INPUT_FILES_DEFAULT
      See Also:
    • REWRITE_ALL

      public static final String REWRITE_ALL
      Overrides other options and forces rewriting of all provided files.
      See Also:
    • REWRITE_ALL_DEFAULT

      public static final boolean REWRITE_ALL_DEFAULT
      See Also:
    • MAX_FILE_GROUP_SIZE_BYTES

      public static final String MAX_FILE_GROUP_SIZE_BYTES
      This option controls the largest amount of data that should be rewritten in a single file group. It helps with breaking down the rewriting of very large partitions which may not be rewritable otherwise due to the resource constraints of the cluster. For example, a sort-based rewrite may not scale to TB-sized partitions, and those partitions need to be worked on in small subsections to avoid exhaustion of resources.
      See Also:
    • MAX_FILE_GROUP_SIZE_BYTES_DEFAULT

      public static final long MAX_FILE_GROUP_SIZE_BYTES_DEFAULT
      See Also:
  • Constructor Details

    • SizeBasedFileRewriter

      protected SizeBasedFileRewriter(Table table)
  • Method Details

    • defaultTargetFileSize

      protected abstract long defaultTargetFileSize()
    • filterFiles

      protected abstract Iterable<T> filterFiles(Iterable<T> tasks)
    • filterFileGroups

      protected abstract Iterable<List<T>> filterFileGroups(List<List<T>> groups)
    • table

      protected Table table()
    • validOptions

      public Set<String> validOptions()
      Description copied from interface: FileRewriter
      Returns a set of supported options for this rewriter. Only options specified in this list will be accepted at runtime. Any other options will be rejected.
      Specified by:
      validOptions in interface FileRewriter<T extends ContentScanTask<F>,F extends ContentFile<F>>
    • init

      public void init(Map<String,String> options)
      Description copied from interface: FileRewriter
      Initializes this rewriter using provided options.
      Specified by:
      init in interface FileRewriter<T extends ContentScanTask<F>,F extends ContentFile<F>>
      Parameters:
      options - options to initialize this rewriter
    • wronglySized

      protected boolean wronglySized(T task)
    • planFileGroups

      public Iterable<List<T>> planFileGroups(Iterable<T> tasks)
      Description copied from interface: FileRewriter
      Selects files which this rewriter believes are valid targets to be rewritten based on their scan tasks and groups those scan tasks into file groups. The file groups are then rewritten in a single executable unit, such as a Spark job.
      Specified by:
      planFileGroups in interface FileRewriter<T extends ContentScanTask<F>,F extends ContentFile<F>>
      Parameters:
      tasks - an iterable of scan task for files in a partition
      Returns:
      groups of scan tasks for files to be rewritten in a single executable unit
    • enoughInputFiles

      protected boolean enoughInputFiles(List<T> group)
    • enoughContent

      protected boolean enoughContent(List<T> group)
    • tooMuchContent

      protected boolean tooMuchContent(List<T> group)
    • inputSize

      protected long inputSize(List<T> group)
    • splitSize

      protected long splitSize(long inputSize)
      Calculates the split size to use in bin-packing rewrites.

      This method determines the target split size as the input size divided by the desired number of output files. The final split size is adjusted to be at least as big as the target file size but less than the max write file size.

    • numOutputFiles

      protected long numOutputFiles(long inputSize)
      Determines the preferable number of output files when rewriting a particular file group.

      If the rewriter is handling 10.1 GB of data with a target file size of 1 GB, it could produce 11 files, one of which would only have 0.1 GB. This would most likely be less preferable to 10 files with 1.01 GB each. So this method decides whether to round up or round down based on what the estimated average file size will be if the remainder (0.1 GB) is distributed amongst other files. If the new average file size is no more than 10% greater than the target file size, then this method will round down when determining the number of output files. Otherwise, the remainder will be written into a separate file.

      Parameters:
      inputSize - a total input size for a file group
      Returns:
      the number of files this rewriter should create
    • writeMaxFileSize

      protected long writeMaxFileSize()
      Estimates a larger max target file size than the target size used in task creation to avoid creating tiny remainder files.

      While we create tasks that should all be smaller than our target size, there is a chance that the actual data will end up being larger than our target size due to various factors of compression, serialization, which are outside our control. If this occurs, instead of making a single file that is close in size to our target, we would end up producing one file of the target size, and then a small extra file with the remaining data.

      For example, if our target is 512 MB, we may generate a rewrite task that should be 500 MB. When we write the data we may find we actually have to write out 530 MB. If we use the target size while writing, we would produce a 512 MB file and an 18 MB file. If instead we use a larger size estimated by this method, then we end up writing a single file.

      Returns:
      the target size plus one half of the distance between max and target
    • outputSpec

      protected PartitionSpec outputSpec()
    • outputSpecId

      protected int outputSpecId()