Class SizeBasedFileRewriter<T extends ContentScanTask<F>,​F extends ContentFile<F>>

    • Method Summary

      All Methods Instance Methods Abstract Methods Concrete Methods 
      Modifier and Type Method Description
      protected abstract long defaultTargetFileSize()  
      protected boolean enoughContent​(java.util.List<T> group)  
      protected boolean enoughInputFiles​(java.util.List<T> group)  
      protected abstract java.lang.Iterable<java.util.List<T>> filterFileGroups​(java.util.List<java.util.List<T>> groups)  
      protected abstract java.lang.Iterable<T> filterFiles​(java.lang.Iterable<T> tasks)  
      void init​(java.util.Map<java.lang.String,​java.lang.String> options)
      Initializes this rewriter using provided options.
      protected long inputSize​(java.util.List<T> group)  
      protected long numOutputFiles​(long inputSize)
      Determines the preferable number of output files when rewriting a particular file group.
      protected PartitionSpec outputSpec()  
      protected int outputSpecId()  
      java.lang.Iterable<java.util.List<T>> planFileGroups​(java.lang.Iterable<T> tasks)
      Selects files which this rewriter believes are valid targets to be rewritten based on their scan tasks and groups those scan tasks into file groups.
      protected long splitSize​(long inputSize)
      Calculates the split size to use in bin-packing rewrites.
      protected Table table()  
      protected boolean tooMuchContent​(java.util.List<T> group)  
      java.util.Set<java.lang.String> validOptions()
      Returns a set of supported options for this rewriter.
      protected long writeMaxFileSize()
      Estimates a larger max target file size than the target size used in task creation to avoid creating tiny remainder files.
      protected boolean wronglySized​(T task)  
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • TARGET_FILE_SIZE_BYTES

        public static final java.lang.String TARGET_FILE_SIZE_BYTES
        The target output file size that this file rewriter will attempt to generate.
        See Also:
        Constant Field Values
      • MIN_FILE_SIZE_BYTES

        public static final java.lang.String MIN_FILE_SIZE_BYTES
        Controls which files will be considered for rewriting. Files with sizes under this threshold will be considered for rewriting regardless of any other criteria.

        Defaults to 75% of the target file size.

        See Also:
        Constant Field Values
      • MIN_FILE_SIZE_DEFAULT_RATIO

        public static final double MIN_FILE_SIZE_DEFAULT_RATIO
        See Also:
        Constant Field Values
      • MAX_FILE_SIZE_BYTES

        public static final java.lang.String MAX_FILE_SIZE_BYTES
        Controls which files will be considered for rewriting. Files with sizes above this threshold will be considered for rewriting regardless of any other criteria.

        Defaults to 180% of the target file size.

        See Also:
        Constant Field Values
      • MAX_FILE_SIZE_DEFAULT_RATIO

        public static final double MAX_FILE_SIZE_DEFAULT_RATIO
        See Also:
        Constant Field Values
      • MIN_INPUT_FILES

        public static final java.lang.String MIN_INPUT_FILES
        Any file group exceeding this number of files will be rewritten regardless of other criteria. This config ensures file groups that contain many files are compacted even if the total size of that group is less than the target file size. This can also be thought of as the maximum number of wrongly sized files that could remain in a partition after rewriting.
        See Also:
        Constant Field Values
      • MIN_INPUT_FILES_DEFAULT

        public static final int MIN_INPUT_FILES_DEFAULT
        See Also:
        Constant Field Values
      • REWRITE_ALL

        public static final java.lang.String REWRITE_ALL
        Overrides other options and forces rewriting of all provided files.
        See Also:
        Constant Field Values
      • MAX_FILE_GROUP_SIZE_BYTES

        public static final java.lang.String MAX_FILE_GROUP_SIZE_BYTES
        This option controls the largest amount of data that should be rewritten in a single file group. It helps with breaking down the rewriting of very large partitions which may not be rewritable otherwise due to the resource constraints of the cluster. For example, a sort-based rewrite may not scale to TB-sized partitions, and those partitions need to be worked on in small subsections to avoid exhaustion of resources.
        See Also:
        Constant Field Values
      • MAX_FILE_GROUP_SIZE_BYTES_DEFAULT

        public static final long MAX_FILE_GROUP_SIZE_BYTES_DEFAULT
        See Also:
        Constant Field Values
    • Constructor Detail

      • SizeBasedFileRewriter

        protected SizeBasedFileRewriter​(Table table)
    • Method Detail

      • defaultTargetFileSize

        protected abstract long defaultTargetFileSize()
      • filterFiles

        protected abstract java.lang.Iterable<T> filterFiles​(java.lang.Iterable<T> tasks)
      • filterFileGroups

        protected abstract java.lang.Iterable<java.util.List<T>> filterFileGroups​(java.util.List<java.util.List<T>> groups)
      • table

        protected Table table()
      • validOptions

        public java.util.Set<java.lang.String> validOptions()
        Description copied from interface: FileRewriter
        Returns a set of supported options for this rewriter. Only options specified in this list will be accepted at runtime. Any other options will be rejected.
        Specified by:
        validOptions in interface FileRewriter<T extends ContentScanTask<F>,​F extends ContentFile<F>>
      • init

        public void init​(java.util.Map<java.lang.String,​java.lang.String> options)
        Description copied from interface: FileRewriter
        Initializes this rewriter using provided options.
        Specified by:
        init in interface FileRewriter<T extends ContentScanTask<F>,​F extends ContentFile<F>>
        Parameters:
        options - options to initialize this rewriter
      • wronglySized

        protected boolean wronglySized​(T task)
      • planFileGroups

        public java.lang.Iterable<java.util.List<T>> planFileGroups​(java.lang.Iterable<T> tasks)
        Description copied from interface: FileRewriter
        Selects files which this rewriter believes are valid targets to be rewritten based on their scan tasks and groups those scan tasks into file groups. The file groups are then rewritten in a single executable unit, such as a Spark job.
        Specified by:
        planFileGroups in interface FileRewriter<T extends ContentScanTask<F>,​F extends ContentFile<F>>
        Parameters:
        tasks - an iterable of scan task for files in a partition
        Returns:
        groups of scan tasks for files to be rewritten in a single executable unit
      • enoughInputFiles

        protected boolean enoughInputFiles​(java.util.List<T> group)
      • enoughContent

        protected boolean enoughContent​(java.util.List<T> group)
      • tooMuchContent

        protected boolean tooMuchContent​(java.util.List<T> group)
      • inputSize

        protected long inputSize​(java.util.List<T> group)
      • splitSize

        protected long splitSize​(long inputSize)
        Calculates the split size to use in bin-packing rewrites.

        This method determines the target split size as the input size divided by the desired number of output files. The final split size is adjusted to be at least as big as the target file size but less than the max write file size.

      • numOutputFiles

        protected long numOutputFiles​(long inputSize)
        Determines the preferable number of output files when rewriting a particular file group.

        If the rewriter is handling 10.1 GB of data with a target file size of 1 GB, it could produce 11 files, one of which would only have 0.1 GB. This would most likely be less preferable to 10 files with 1.01 GB each. So this method decides whether to round up or round down based on what the estimated average file size will be if the remainder (0.1 GB) is distributed amongst other files. If the new average file size is no more than 10% greater than the target file size, then this method will round down when determining the number of output files. Otherwise, the remainder will be written into a separate file.

        Parameters:
        inputSize - a total input size for a file group
        Returns:
        the number of files this rewriter should create
      • writeMaxFileSize

        protected long writeMaxFileSize()
        Estimates a larger max target file size than the target size used in task creation to avoid creating tiny remainder files.

        While we create tasks that should all be smaller than our target size, there is a chance that the actual data will end up being larger than our target size due to various factors of compression, serialization, which are outside our control. If this occurs, instead of making a single file that is close in size to our target, we would end up producing one file of the target size, and then a small extra file with the remaining data.

        For example, if our target is 512 MB, we may generate a rewrite task that should be 500 MB. When we write the data we may find we actually have to write out 530 MB. If we use the target size while writing, we would produce a 512 MB file and an 18 MB file. If instead we use a larger size estimated by this method, then we end up writing a single file.

        Returns:
        the target size plus one half of the distance between max and target
      • outputSpecId

        protected int outputSpecId()