Class BinPackStrategy

    • Constructor Summary

      Constructors 
      Constructor Description
      BinPackStrategy()  
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      protected long inputFileSize​(java.util.List<FileScanTask> fileToRewrite)  
      java.lang.String name()
      Returns the name of this rewrite strategy
      protected long numOutputFiles​(long totalSizeInBytes)
      Determine how many output files to create when rewriting.
      RewriteStrategy options​(java.util.Map<java.lang.String,​java.lang.String> options)
      Sets options to be used with this strategy
      java.lang.Iterable<java.util.List<FileScanTask>> planFileGroups​(java.lang.Iterable<FileScanTask> dataFiles)
      Groups file scans into lists which will be processed in a single executable unit.
      java.lang.Iterable<FileScanTask> selectFilesToRewrite​(java.lang.Iterable<FileScanTask> dataFiles)
      Selects files which this strategy believes are valid targets to be rewritten.
      protected long splitSize​(long totalSizeInBytes)
      Returns the smallest of our max write file threshold, and our estimated split size based on the number of output files we want to generate.
      protected long targetFileSize()  
      java.util.Set<java.lang.String> validOptions()
      Returns a set of options which this rewrite strategy can use.
      protected long writeMaxFileSize()
      Estimates a larger max target file size than our target size used in task creation to avoid tasks which are predicted to have a certain size, but exceed that target size when serde is complete creating tiny remainder files.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • MIN_INPUT_FILES

        public static final java.lang.String MIN_INPUT_FILES
        The minimum number of files that need to be in a file group for it to be considered for compaction if the total size of that group is less than the RewriteDataFiles.TARGET_FILE_SIZE_BYTES. This can also be thought of as the maximum number of non-target-size files that could remain in a file group (partition) after rewriting.
        See Also:
        Constant Field Values
      • MIN_INPUT_FILES_DEFAULT

        public static final int MIN_INPUT_FILES_DEFAULT
        See Also:
        Constant Field Values
      • MIN_FILE_SIZE_BYTES

        public static final java.lang.String MIN_FILE_SIZE_BYTES
        Adjusts files which will be considered for rewriting. Files smaller than MIN_FILE_SIZE_BYTES will be considered for rewriting. This functions independently of MAX_FILE_SIZE_BYTES.

        Defaults to 75% of the target file size

        See Also:
        Constant Field Values
      • MIN_FILE_SIZE_DEFAULT_RATIO

        public static final double MIN_FILE_SIZE_DEFAULT_RATIO
        See Also:
        Constant Field Values
      • MAX_FILE_SIZE_BYTES

        public static final java.lang.String MAX_FILE_SIZE_BYTES
        Adjusts files which will be considered for rewriting. Files larger than MAX_FILE_SIZE_BYTES will be considered for rewriting. This functions independently of MIN_FILE_SIZE_BYTES.

        Defaults to 180% of the target file size

        See Also:
        Constant Field Values
      • MAX_FILE_SIZE_DEFAULT_RATIO

        public static final double MAX_FILE_SIZE_DEFAULT_RATIO
        See Also:
        Constant Field Values
      • DELETE_FILE_THRESHOLD

        public static final java.lang.String DELETE_FILE_THRESHOLD
        The minimum number of deletes that needs to be associated with a data file for it to be considered for rewriting. If a data file has this number of deletes or more, it will be rewritten regardless of its file size determined by MIN_FILE_SIZE_BYTES and MAX_FILE_SIZE_BYTES. If a file group contains a file that satisfies this condition, the file group will be rewritten regardless of the number of files in the file group determined by MIN_INPUT_FILES

        Defaults to Integer.MAX_VALUE, which means this feature is not enabled by default.

        See Also:
        Constant Field Values
      • DELETE_FILE_THRESHOLD_DEFAULT

        public static final int DELETE_FILE_THRESHOLD_DEFAULT
        See Also:
        Constant Field Values
      • REWRITE_ALL

        public static final java.lang.String REWRITE_ALL
        Rewrites all files, regardless of their size. Defaults to false, rewriting only mis-sized files;
        See Also:
        Constant Field Values
    • Constructor Detail

      • BinPackStrategy

        public BinPackStrategy()
    • Method Detail

      • name

        public java.lang.String name()
        Description copied from interface: RewriteStrategy
        Returns the name of this rewrite strategy
        Specified by:
        name in interface RewriteStrategy
      • validOptions

        public java.util.Set<java.lang.String> validOptions()
        Description copied from interface: RewriteStrategy
        Returns a set of options which this rewrite strategy can use. This is an allowed-list and any options not specified here will be rejected at runtime.
        Specified by:
        validOptions in interface RewriteStrategy
      • selectFilesToRewrite

        public java.lang.Iterable<FileScanTask> selectFilesToRewrite​(java.lang.Iterable<FileScanTask> dataFiles)
        Description copied from interface: RewriteStrategy
        Selects files which this strategy believes are valid targets to be rewritten.
        Specified by:
        selectFilesToRewrite in interface RewriteStrategy
        Parameters:
        dataFiles - iterable of FileScanTasks for files in a given partition
        Returns:
        iterable containing only FileScanTasks to be rewritten
      • planFileGroups

        public java.lang.Iterable<java.util.List<FileScanTask>> planFileGroups​(java.lang.Iterable<FileScanTask> dataFiles)
        Description copied from interface: RewriteStrategy
        Groups file scans into lists which will be processed in a single executable unit. Each group will end up being committed as an independent set of changes. This creates the jobs which will eventually be run as by the underlying Action.
        Specified by:
        planFileGroups in interface RewriteStrategy
        Parameters:
        dataFiles - iterable of FileScanTasks to be rewritten
        Returns:
        iterable of lists of FileScanTasks which will be processed together
      • targetFileSize

        protected long targetFileSize()
      • numOutputFiles

        protected long numOutputFiles​(long totalSizeInBytes)
        Determine how many output files to create when rewriting. We use this to determine the split-size we want to use when actually writing files to avoid the following situation.

        If we are writing 10.1 G of data with a target file size of 1G we would end up with 11 files, one of which would only have 0.1g. This would most likely be less preferable to 10 files each of which was 1.01g. So here we decide whether to round up or round down based on what the estimated average file size will be if we ignore the remainder (0.1g). If the new file size is less than 10% greater than the target file size then we will round down when determining the number of output files.

        Parameters:
        totalSizeInBytes - total data size for a file group
        Returns:
        the number of files this strategy should create
      • splitSize

        protected long splitSize​(long totalSizeInBytes)
        Returns the smallest of our max write file threshold, and our estimated split size based on the number of output files we want to generate. Add a overhead onto the estimated splitSize to try to avoid small errors in size creating brand-new files.
      • inputFileSize

        protected long inputFileSize​(java.util.List<FileScanTask> fileToRewrite)
      • writeMaxFileSize

        protected long writeMaxFileSize()
        Estimates a larger max target file size than our target size used in task creation to avoid tasks which are predicted to have a certain size, but exceed that target size when serde is complete creating tiny remainder files.

        While we create tasks that should all be smaller than our target size there is a chance that the actual data will end up being larger than our target size due to various factors of compression, serialization and other factors outside our control. If this occurs, instead of making a single file that is close in size to our target we would end up producing one file of the target size, and then a small extra file with the remaining data. For example, if our target is 512 MB we may generate a rewrite task that should be 500 MB. When we write the data we may find we actually have to write out 530 MB. If we use the target size while writing we would produced a 512 MB file and a 18 MB file. If instead we use a larger size estimated by this method, then we end up writing a single file.

        Returns:
        the target size plus one half of the distance between max and target