Class BaseExpireSnapshotsSparkAction

  • All Implemented Interfaces:
    Action<ExpireSnapshots,​ExpireSnapshots.Result>, ExpireSnapshots

    public class BaseExpireSnapshotsSparkAction
    extends java.lang.Object
    implements ExpireSnapshots
    An action that performs the same operation as ExpireSnapshots but uses Spark to determine the delta in files between the pre and post-expiration table metadata. All of the same restrictions of ExpireSnapshots also apply to this action.

    This action first leverages ExpireSnapshots to expire snapshots and then uses metadata tables to find files that can be safely deleted. This is done by anti-joining two Datasets that contain all manifest and data files before and after the expiration. The snapshot expiration will be fully committed before any deletes are issued.

    This operation performs a shuffle so the parallelism can be controlled through 'spark.sql.shuffle.partitions'.

    Deletes are still performed locally after retrieving the results from the Spark executors.

    • Constructor Detail

      • BaseExpireSnapshotsSparkAction

        public BaseExpireSnapshotsSparkAction​(org.apache.spark.sql.SparkSession spark,
                                              Table table)
    • Method Detail

      • retainLast

        public BaseExpireSnapshotsSparkAction retainLast​(int numSnapshots)
        Description copied from interface: ExpireSnapshots
        Retains the most recent ancestors of the current snapshot.

        If a snapshot would be expired because it is older than the expiration timestamp, but is one of the numSnapshots most recent ancestors of the current state, it will be retained. This will not cause snapshots explicitly identified by id from expiring.

        Identical to ExpireSnapshots.retainLast(int)

        Specified by:
        retainLast in interface ExpireSnapshots
        Parameters:
        numSnapshots - the number of snapshots to retain
        Returns:
        this for method chaining
      • deleteWith

        public BaseExpireSnapshotsSparkAction deleteWith​(java.util.function.Consumer<java.lang.String> newDeleteFunc)
        Description copied from interface: ExpireSnapshots
        Passes an alternative delete implementation that will be used for manifests and data files.

        Manifest files that are no longer used by valid snapshots will be deleted. Data files that were deleted by snapshots that are expired will be deleted.

        If this method is not called, unnecessary manifests and data files will still be deleted.

        Identical to ExpireSnapshots.deleteWith(Consumer)

        Specified by:
        deleteWith in interface ExpireSnapshots
        Parameters:
        newDeleteFunc - a function that will be called to delete manifests and data files
        Returns:
        this for method chaining
      • expire

        public org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> expire()
        Expires snapshots and commits the changes to the table, returning a Dataset of files to delete.

        This does not delete data files. To delete data files, run execute().

        This may be called before or after execute() is called to return the expired file list.

        Returns:
        a Dataset of files that are no longer referenced by the table
      • spark

        protected org.apache.spark.sql.SparkSession spark()
      • sparkContext

        protected org.apache.spark.api.java.JavaSparkContext sparkContext()
      • option

        public ThisT option​(java.lang.String name,
                            java.lang.String value)
        Description copied from interface: Action
        Configures this action with an extra option.

        Certain actions allow users to control internal details of their execution via options.

        Specified by:
        option in interface Action<ThisT,​R>
        Parameters:
        name - an option name
        value - an option value
        Returns:
        this for method chaining
      • options

        public ThisT options​(java.util.Map<java.lang.String,​java.lang.String> newOptions)
        Description copied from interface: Action
        Configures this action with extra options.

        Certain actions allow users to control internal details of their execution via options.

        Specified by:
        options in interface Action<ThisT,​R>
        Parameters:
        newOptions - a map of extra options
        Returns:
        this for method chaining
      • options

        protected java.util.Map<java.lang.String,​java.lang.String> options()
      • withJobGroupInfo

        protected <T> T withJobGroupInfo​(JobGroupInfo info,
                                         java.util.function.Supplier<T> supplier)
      • newJobGroupInfo

        protected JobGroupInfo newJobGroupInfo​(java.lang.String groupId,
                                               java.lang.String desc)
      • buildValidDataFileDF

        protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> buildValidDataFileDF​(Table table)
      • buildManifestFileDF

        protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> buildManifestFileDF​(Table table)
      • buildManifestListDF

        protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> buildManifestListDF​(Table table)
      • buildOtherMetadataFileDF

        protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> buildOtherMetadataFileDF​(Table table)
      • buildValidMetadataFileDF

        protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> buildValidMetadataFileDF​(Table table)
      • loadMetadataTable

        protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> loadMetadataTable​(Table table,
                                                                                           MetadataTableType type)