Class SparkSortStrategy

    • Field Detail

      • COMPRESSION_FACTOR

        public static final java.lang.String COMPRESSION_FACTOR
        The number of shuffle partitions and consequently the number of output files created by the Spark Sort is based on the size of the input data files used in this rewrite operation. Due to compression, the disk file sizes may not accurately represent the size of files in the output. This parameter lets the user adjust the file size used for estimating actual output data size. A factor greater than 1.0 would generate more files than we would expect based on the on-disk file size. A value less than 1.0 would create fewer files than we would expect due to the on-disk size.
        See Also:
        Constant Field Values
    • Constructor Detail

      • SparkSortStrategy

        public SparkSortStrategy​(Table table,
                                 org.apache.spark.sql.SparkSession spark)
    • Method Detail

      • table

        public Table table()
        Description copied from interface: RewriteStrategy
        Returns the table being modified by this rewrite strategy
      • validOptions

        public java.util.Set<java.lang.String> validOptions()
        Description copied from interface: RewriteStrategy
        Returns a set of options which this rewrite strategy can use. This is an allowed-list and any options not specified here will be rejected at runtime.
        Specified by:
        validOptions in interface RewriteStrategy
        Overrides:
        validOptions in class SortStrategy
      • rewriteFiles

        public java.util.Set<DataFile> rewriteFiles​(java.util.List<FileScanTask> filesToRewrite)
        Description copied from interface: RewriteStrategy
        Method which will rewrite files based on this particular RewriteStrategy's algorithm. This will most likely be Action framework specific (Spark/Presto/Flink ....).
        Parameters:
        filesToRewrite - a group of files to be rewritten together
        Returns:
        a set of newly written files
      • spark

        protected org.apache.spark.sql.SparkSession spark()
      • sortPlan

        protected org.apache.spark.sql.catalyst.plans.logical.LogicalPlan sortPlan​(org.apache.spark.sql.connector.distributions.Distribution distribution,
                                                                                   org.apache.spark.sql.connector.expressions.SortOrder[] ordering,
                                                                                   org.apache.spark.sql.catalyst.plans.logical.LogicalPlan plan,
                                                                                   org.apache.spark.sql.internal.SQLConf conf)
      • sizeEstimateMultiple

        protected double sizeEstimateMultiple()