Class ExpireSnapshotsSparkAction

  • All Implemented Interfaces:
    Action<ExpireSnapshots,​ExpireSnapshots.Result>, ExpireSnapshots

    public class ExpireSnapshotsSparkAction
    extends java.lang.Object
    implements ExpireSnapshots
    An action that performs the same operation as ExpireSnapshots but uses Spark to determine the delta in files between the pre and post-expiration table metadata. All of the same restrictions of ExpireSnapshots also apply to this action.

    This action first leverages ExpireSnapshots to expire snapshots and then uses metadata tables to find files that can be safely deleted. This is done by anti-joining two Datasets that contain all manifest and content files before and after the expiration. The snapshot expiration will be fully committed before any deletes are issued.

    This operation performs a shuffle so the parallelism can be controlled through 'spark.sql.shuffle.partitions'.

    Deletes are still performed locally after retrieving the results from the Spark executors.

    • Field Detail

      • STREAM_RESULTS_DEFAULT

        public static final boolean STREAM_RESULTS_DEFAULT
        See Also:
        Constant Field Values
      • STATISTICS_FILES

        protected static final java.lang.String STATISTICS_FILES
        See Also:
        Constant Field Values
      • COMMA_SPLITTER

        protected static final org.apache.iceberg.relocated.com.google.common.base.Splitter COMMA_SPLITTER
      • COMMA_JOINER

        protected static final org.apache.iceberg.relocated.com.google.common.base.Joiner COMMA_JOINER
    • Method Detail

      • retainLast

        public ExpireSnapshotsSparkAction retainLast​(int numSnapshots)
        Description copied from interface: ExpireSnapshots
        Retains the most recent ancestors of the current snapshot.

        If a snapshot would be expired because it is older than the expiration timestamp, but is one of the numSnapshots most recent ancestors of the current state, it will be retained. This will not cause snapshots explicitly identified by id from expiring.

        Identical to ExpireSnapshots.retainLast(int)

        Specified by:
        retainLast in interface ExpireSnapshots
        Parameters:
        numSnapshots - the number of snapshots to retain
        Returns:
        this for method chaining
      • deleteWith

        public ExpireSnapshotsSparkAction deleteWith​(java.util.function.Consumer<java.lang.String> newDeleteFunc)
        Description copied from interface: ExpireSnapshots
        Passes an alternative delete implementation that will be used for manifests, data and delete files.

        Manifest files that are no longer used by valid snapshots will be deleted. Content files that were marked as logically deleted by snapshots that are expired will be deleted as well.

        If this method is not called, unnecessary manifests and content files will still be deleted.

        Identical to ExpireSnapshots.deleteWith(Consumer)

        Specified by:
        deleteWith in interface ExpireSnapshots
        Parameters:
        newDeleteFunc - a function that will be called to delete manifests and data files
        Returns:
        this for method chaining
      • expireFiles

        public org.apache.spark.sql.Dataset<FileInfo> expireFiles()
        Expires snapshots and commits the changes to the table, returning a Dataset of files to delete.

        This does not delete data files. To delete data files, run execute().

        This may be called before or after execute() to return the expired files.

        Returns:
        a Dataset of files that are no longer referenced by the table
      • spark

        protected org.apache.spark.sql.SparkSession spark()
      • sparkContext

        protected org.apache.spark.api.java.JavaSparkContext sparkContext()
      • option

        public ThisT option​(java.lang.String name,
                            java.lang.String value)
      • options

        public ThisT options​(java.util.Map<java.lang.String,​java.lang.String> newOptions)
      • options

        protected java.util.Map<java.lang.String,​java.lang.String> options()
      • withJobGroupInfo

        protected <T> T withJobGroupInfo​(JobGroupInfo info,
                                         java.util.function.Supplier<T> supplier)
      • newJobGroupInfo

        protected JobGroupInfo newJobGroupInfo​(java.lang.String groupId,
                                               java.lang.String desc)
      • contentFileDS

        protected org.apache.spark.sql.Dataset<FileInfo> contentFileDS​(Table table)
      • contentFileDS

        protected org.apache.spark.sql.Dataset<FileInfo> contentFileDS​(Table table,
                                                                       java.util.Set<java.lang.Long> snapshotIds)
      • manifestDS

        protected org.apache.spark.sql.Dataset<FileInfo> manifestDS​(Table table)
      • manifestDS

        protected org.apache.spark.sql.Dataset<FileInfo> manifestDS​(Table table,
                                                                    java.util.Set<java.lang.Long> snapshotIds)
      • manifestListDS

        protected org.apache.spark.sql.Dataset<FileInfo> manifestListDS​(Table table)
      • manifestListDS

        protected org.apache.spark.sql.Dataset<FileInfo> manifestListDS​(Table table,
                                                                        java.util.Set<java.lang.Long> snapshotIds)
      • statisticsFileDS

        protected org.apache.spark.sql.Dataset<FileInfo> statisticsFileDS​(Table table,
                                                                          java.util.Set<java.lang.Long> snapshotIds)
      • otherMetadataFileDS

        protected org.apache.spark.sql.Dataset<FileInfo> otherMetadataFileDS​(Table table)
      • allReachableOtherMetadataFileDS

        protected org.apache.spark.sql.Dataset<FileInfo> allReachableOtherMetadataFileDS​(Table table)
      • loadMetadataTable

        protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> loadMetadataTable​(Table table,
                                                                                           MetadataTableType type)
      • deleteFiles

        protected org.apache.iceberg.spark.actions.BaseSparkAction.DeleteSummary deleteFiles​(java.util.concurrent.ExecutorService executorService,
                                                                                             java.util.function.Consumer<java.lang.String> deleteFunc,
                                                                                             java.util.Iterator<FileInfo> files)
        Deletes files and keeps track of how many files were removed for each file type.
        Parameters:
        executorService - an executor service to use for parallel deletes
        deleteFunc - a delete func
        files - an iterator of Spark rows of the structure (path: String, type: String)
        Returns:
        stats on which files were deleted
      • deleteFiles

        protected org.apache.iceberg.spark.actions.BaseSparkAction.DeleteSummary deleteFiles​(SupportsBulkOperations io,
                                                                                             java.util.Iterator<FileInfo> files)