Class SortStrategy

  • All Implemented Interfaces:
    java.io.Serializable, RewriteStrategy
    Direct Known Subclasses:
    Spark3SortStrategy

    public abstract class SortStrategy
    extends BinPackStrategy
    A rewrite strategy for data files which aims to reorder data with data files to optimally lay them out in relation to a column. For example, if the Sort strategy is used on a set of files which is ordered by column x and original has files File A (x: 0 - 50), File B ( x: 10 - 40) and File C ( x: 30 - 60), this Strategy will attempt to rewrite those files into File A' (x: 0-20), File B' (x: 21 - 40), File C' (x: 41 - 60).

    Currently the there is no file overlap detection and we will rewrite all files if REWRITE_ALL is true (default: false). If this property is disabled any files that would be chosen by BinPackStrategy will be rewrite candidates.

    In the future other algorithms for determining files to rewrite will be provided.

    See Also:
    Serialized Form
    • Field Detail

      • REWRITE_ALL

        public static final java.lang.String REWRITE_ALL
        Rewrites all files, regardless of their size. Defaults to false, rewriting only mis-sized files;
        See Also:
        Constant Field Values
    • Constructor Detail

      • SortStrategy

        public SortStrategy()
    • Method Detail

      • sortOrder

        public SortStrategy sortOrder​(SortOrder order)
        Sets the sort order to be used in this strategy when rewriting files
        Parameters:
        order - the order to use
        Returns:
        this for method chaining
      • sortOrder

        protected SortOrder sortOrder()
      • validOptions

        public java.util.Set<java.lang.String> validOptions()
        Description copied from interface: RewriteStrategy
        Returns a set of options which this rewrite strategy can use. This is an allowed-list and any options not specified here will be rejected at runtime.
        Specified by:
        validOptions in interface RewriteStrategy
        Overrides:
        validOptions in class BinPackStrategy
      • planFileGroups

        public java.lang.Iterable<java.util.List<FileScanTask>> planFileGroups​(java.lang.Iterable<FileScanTask> dataFiles)
        Description copied from interface: RewriteStrategy
        Groups file scans into lists which will be processed in a single executable unit. Each group will end up being committed as an independent set of changes. This creates the jobs which will eventually be run as by the underlying Action.
        Specified by:
        planFileGroups in interface RewriteStrategy
        Overrides:
        planFileGroups in class BinPackStrategy
        Parameters:
        dataFiles - iterable of FileScanTasks to be rewritten
        Returns:
        iterable of lists of FileScanTasks which will be processed together
      • validateOptions

        protected void validateOptions()