Class ExpireSnapshotsSparkAction

  • All Implemented Interfaces:
    Action<ExpireSnapshots,​ExpireSnapshots.Result>, ExpireSnapshots

    public class ExpireSnapshotsSparkAction
    extends java.lang.Object
    implements ExpireSnapshots
    An action that performs the same operation as ExpireSnapshots but uses Spark to determine the delta in files between the pre and post-expiration table metadata. All of the same restrictions of ExpireSnapshots also apply to this action.

    This action first leverages ExpireSnapshots to expire snapshots and then uses metadata tables to find files that can be safely deleted. This is done by anti-joining two Datasets that contain all manifest and content files before and after the expiration. The snapshot expiration will be fully committed before any deletes are issued.

    This operation performs a shuffle so the parallelism can be controlled through 'spark.sql.shuffle.partitions'.

    Deletes are still performed locally after retrieving the results from the Spark executors.

    • Method Detail

      • retainLast

        public ExpireSnapshotsSparkAction retainLast​(int numSnapshots)
        Description copied from interface: ExpireSnapshots
        Retains the most recent ancestors of the current snapshot.

        If a snapshot would be expired because it is older than the expiration timestamp, but is one of the numSnapshots most recent ancestors of the current state, it will be retained. This will not cause snapshots explicitly identified by id from expiring.

        Identical to ExpireSnapshots.retainLast(int)

        Specified by:
        retainLast in interface ExpireSnapshots
        Parameters:
        numSnapshots - the number of snapshots to retain
        Returns:
        this for method chaining
      • deleteWith

        public ExpireSnapshotsSparkAction deleteWith​(java.util.function.Consumer<java.lang.String> newDeleteFunc)
        Description copied from interface: ExpireSnapshots
        Passes an alternative delete implementation that will be used for manifests, data and delete files.

        Manifest files that are no longer used by valid snapshots will be deleted. Content files that were marked as logically deleted by snapshots that are expired will be deleted as well.

        If this method is not called, unnecessary manifests and content files will still be deleted.

        Identical to ExpireSnapshots.deleteWith(Consumer)

        Specified by:
        deleteWith in interface ExpireSnapshots
        Parameters:
        newDeleteFunc - a function that will be called to delete manifests and data files
        Returns:
        this for method chaining
      • expire

        public org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> expire()
        Expires snapshots and commits the changes to the table, returning a Dataset of files to delete.

        This does not delete data files. To delete data files, run execute().

        This may be called before or after execute() is called to return the expired file list.

        Returns:
        a Dataset of files that are no longer referenced by the table
      • spark

        protected org.apache.spark.sql.SparkSession spark()
      • sparkContext

        protected org.apache.spark.api.java.JavaSparkContext sparkContext()
      • option

        public ThisT option​(java.lang.String name,
                            java.lang.String value)
      • options

        public ThisT options​(java.util.Map<java.lang.String,​java.lang.String> newOptions)
      • options

        protected java.util.Map<java.lang.String,​java.lang.String> options()
      • withJobGroupInfo

        protected <T> T withJobGroupInfo​(JobGroupInfo info,
                                         java.util.function.Supplier<T> supplier)
      • newJobGroupInfo

        protected JobGroupInfo newJobGroupInfo​(java.lang.String groupId,
                                               java.lang.String desc)
      • buildValidContentFileWithTypeDF

        protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> buildValidContentFileWithTypeDF​(Table table)
      • buildValidContentFileDF

        protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> buildValidContentFileDF​(Table table)
      • buildManifestFileDF

        protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> buildManifestFileDF​(Table table)
      • buildManifestListDF

        protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> buildManifestListDF​(Table table)
      • buildOtherMetadataFileDF

        protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> buildOtherMetadataFileDF​(Table table)
      • buildAllReachableOtherMetadataFileDF

        protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> buildAllReachableOtherMetadataFileDF​(Table table)
      • buildValidMetadataFileDF

        protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> buildValidMetadataFileDF​(Table table)
      • withFileType

        protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> withFileType​(org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> ds,
                                                                                      java.lang.String type)
      • loadMetadataTable

        protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> loadMetadataTable​(Table table,
                                                                                           MetadataTableType type)